In ca409892c5396fa3fbb8ea4dbf53d0e952f36d09, frame indexes started
being treated more like registers, rather than immediates. Update
the commute logic to avoid failing the verifier by moving illegal
SGPR operands in place of a frame index.
This reverts commit 8a849a2a567d4e519b246a16936b6e7519936d4b.
It seems I missed a spot when trying to ensure the code in the
instruction selection tests were actually legalized MIR.
If the runtime flat address resolves to a scratch address,
64-bit atomics do not work correctly. Insert a runtime address
space check (which is quite likely to be uniform) and select between
the non-atomic and real atomic cases.
Consider noalias.addrspace metadata and avoid this expansion when
possible (we also need to consider it to avoid infinitely expanding
after adding the predication code).
Atomic loads are handled differently from the DAG, and have separate opcodes
and explicit control over the extensions, like ordinary loads. Add
new patterns for these.
There's room for cleanup and improvement. d16 cases aren't handled.
Fixes#111645
The int_amdgcn_mov_dpp8 is overloaded, but we can only select i32.
To allow a corresponding builtin to be overloaded the same way as
int_amdgcn_mov_dpp we need it to be able to split unsupported values.
It is not correct to lower "x<<32-y>>32-y" to "bfe x, 0, y". When y
equals to 32, the left-hand side is still x (unchanged), however, the
right-hand side will be evaluated to 0. So it is not always correct to
do such transformation.
We may be able to keep the pattern for immediate y while y is within [0,
31]. However, the immediate operands of the sub (32 - y) are easily
folded, and "(x << imm) >> imm" will be lowered to "and x,
(2^(32-imm))-1" anyway. So no bfe matching is needed.
The correct behaviour is to insert a readfirstlane. SelectionDAG was
already doing this in some cases, but not in the general case for chain
calls. GlobalISel was already doing this for return values but not for
arguments.
This did not consider the IR change to allow a scalar base with a vector
offset part. Reject any users that are not explicitly handled.
In this situation we could handle the vector GEP, but that is a larger
change. This just avoids the IR verifier error by rejecting it.
Even though the Attributor framework will invalidate all its dependent
AAs after the current iteration, a dependent AA can still use the worst
state of a depending AA if it doesn't check the state of the depending
AA in current iteration.
This only changes `llvm/lib/Target/AMDGPU/SIISelLowering.cpp`.
There are five uses of `std::tie` remaining because they can't be
replaced with
C++17 structured bindings.
For disassembler tables we use *V1_V4* variants for VIMAGE and then
remove unused vaddr fields. *V1_V1* variant, which has every vaddr
field other than vaddr0 set to 0, was also enabled and caused confusion
when decoding cases which used v0 (whose encoded value is 0)
When llvm.memcpy or llvm.memmove intrinsics are lowered as a loop in
LowerMemIntrinsics.cpp, the loop consists of a single load/store pair
per iteration. We can improve performance in some cases by emitting
multiple load/store pairs per iteration. This patch achieves that by
increasing the width of the loop lowering type in the GCN target and
letting legalization split the resulting too-wide access pairs into
multiple legal access pairs.
This change only affects lowered memcpys and memmoves with large (>=
1024 bytes) constant lengths. Smaller constant lengths are handled by
ISel directly; non-constant lengths would be slowed down by this change
if the dynamic length was smaller or slightly larger than what an
unrolled iteration copies.
The chosen default unroll factor is the result of microbenchmarks on
gfx1030. This change leads to speedups of 15-38% for global memory and
1.9-5.8x for scratch in these microbenchmarks.
Part of SWDEV-455845.
For image_atomic instructions with TFE, in some cases (e.g., when
dmask=3) the disassembler produces dst register with wrong size (e.g.,
image_atomic_smin v5, v1, s[8:15] dmask:0x3 tfe, instead of v[5:7]).
This patch fixes the VDataDwords values for image atomic instructions.
Tweak VOP2eInst_Base so that it does not rely on !eq comparing an int
value (-1) with a bits<5> value. This is to avoid a change in behaviour
when #112904 lands, which is a bug fix which has the side effect of
implicitly casting template arguments to the declared template parameter
type.
Replace unused analysis (VirtRegMap) dependency with the used one (SlotIndexes)
Initializes `SlotIndexesWrapperPass` which is used by SILowerSGPRSpills to ensure that legacy pass manager finds it.
Removes the initialization for `VirtRegMapWrapperPass` since it is not requested in this pass.
So far, isValidAddrSpaceCast only allows casts to the flat address
space and between the constant(32) address spaces. It does not allow
casting between the global and constant address spaces, even though they
alias. That affects, e.g., the lowering of memmoves from the constant to
the global address space in LowerMemIntrinsics, since that requires
aliasing address spaces to be castable.
This patch relaxes isValidAddrSpaceCast and allows such casts. It also
includes a memmove test that would crash with the previous
implementation because the memmove IR lowering would not be
applicable for the move from constant AS to global AS.
MSG_DEALLOC_VGPRS slows down very small waveslot limited kernels. It's
been identified this message is only really needed for VGPR limited
kernels. A kernel becomes VGPR limited if a total number of VGPRs per
SIMD / number of used VGPRs is more than a number of wave slots.
IMPLICIT_DEF inserted for a wwm-register at the
very first block or the predecessor block where
it is used for sgpr spilling can appear at a block
begin that requires spill-insertion during per-lane
VGPR regalloc phase. The presence of the IMPLICIT_DEF
currently breaks the BB prolog.
Fixes: SWDEV-490717
Some targets (e.g. PPC and Hexagon) already did this. I think it's best
to do this consistently so that frontend authors don't run into
inconsistent results when they emit `naked` functions. For example, in
Zig, we had to change our emit code to also set `frame-pointer=none` to
get reliable results across targets.
Note: I don't have commit access.
In a variety of places we change the bitwidth of a parameter but don't
update the attributes.
The issue in this case is from the `range` attribute when inlining
`__memset_chk`. `optimizeMemSetChk` will replace an `i32` with an
`i8`, and if the `i32` had a `range` attr assosiated it will cause an
error.
Fixes#112633
This fixes all the places that hit the new assertion added in
https://github.com/llvm/llvm-project/pull/106524 in tests. That is,
cases where the value passed to the APInt constructor is not an N-bit
signed/unsigned integer, where N is the bit width and signedness is
determined by the isSigned flag.
The fixes either set the correct value for isSigned, set the
implicitTrunc flag, or perform more calculations inside APInt.
Note that the assertion is currently still disabled by default, so this
patch is mostly NFC.
fp conversion V_CVT_F_F/V_CVT_F_U instructions true16 format were
previously implemented using fake16 profile.
With the MC support inplace, correct and support these instructions in
true16/fake16 format in CodeGen
Modify VOP3 profile and pesudo, and add encoding info for VOP3 True16
including DPP and DPP8 in true16 and fake16 format.
This patch applies true16/fake16 changes and asm/dasm changes to
V_ADD_NC_U16
V_ADD_NC_I16
V_SUB_NC_U16
V_SUB_NC_I16