Recognize when sub-vectors have been split to elements which are used to
build large vector.
This happens when instructions have different vector sizes available.
For example a few arithmetic instruction are required to process all
elements of larger vector that can be stored using one instruction.
Differential Revision: https://reviews.llvm.org/D109242
Recognize when source could have been unmerged to pieces with DstTy
without having to split source to smaller elements
and then merge small elements into DstTy pieces.
This happens when vector was meant to be split to sub-vectors but there
was leftover. At this point artifact combiner have already dealt with
leftover and we can continue to use sub-vectors.
Differential Revision: https://reviews.llvm.org/D109241
Recognize copy that is represented as split of a source register to
elements that were reassembled to another register with the same type.
Differential Revision: https://reviews.llvm.org/D109240
These tests had been commented out but seem to not be crashing.
Not sure if codegen is perfect in each of them, but even if it's not I think it's better to put a TODO to fix codegen than remove the test outright, unless codegen is plain wrong (then I'd still rather XFAIL rather than hide it)
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D136341
In https://github.com/llvm/llvm-project/issues/57452, we found that IRTranslator is translating `i1 true` into `i32 -1`.
This is because IRTranslator uses SExt for indices.
In this fix, we change the expected behavior of extractelement's index, moving from SExt to ZExt.
This change includes both documentation, SelectionDAG and IRTranslator.
We also included a test for AMDGPU, updated tests for AArch64, Mips, PowerPC, RISCV, VE, WebAssembly and X86
This patch fixes issue #57452.
Differential Revision: https://reviews.llvm.org/D132978
Reference: https://gcc.gnu.org/onlinedocs/gccint/Machine-Constraints.html
k: A memory operand whose address is formed by a base register and
(optionally scaled) index register.
m: A memory operand whose address is formed by a base register and
offset that is suitable for use in instructions with the same
addressing mode as st.w and ld.w.
ZB: An address that is held in a general-purpose register. The offset
is zero.
ZC: A memory operand whose address is formed by a base register and
offset that is suitable for use in instructions with the same
addressing mode as ll.w and sc.w.
Note:
The INLINEASM SDNode flags in below tests are updated because the new
introduced enum `Constraint_k` is added before `Constraint_m`.
llvm/test/CodeGen/AArch64/GlobalISel/irtranslator-inline-asm.ll
llvm/test/CodeGen/AMDGPU/GlobalISel/irtranslator-inline-asm.ll
llvm/test/CodeGen/X86/callbr-asm-kill.mir
This patch passes `ninja check-all` on a X86 machine with all official
targets and the LoongArch target enabled.
Differential Revision: https://reviews.llvm.org/D134638
Similar to the current "Trunc/BuildVector" folding - which folds low element extracts of BuildVectors, folds hi element extracts done using bitshifts.
For D134354
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D135148
Previously we would be unable to legalize V2S16 BUILD_VECTOR_TRUNC on GFX8 & below as the custom legalization was missing.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D135149
If we can not prove that f16 operands of a buildvector are canonicalized, then we can not lower into a V_PACK. In this scenario, we would previously lower into some combination of and(sdwa), shr, or. This patch allows for matching into V_PERM instead.
Change-Id: Ifa4a74fdb81ef44f22ba490c7fdf81ec8aebc945
Recognize more opcodes in the function.
Fixes some regressions introduced in D134857 for fdiv.f16 too.
Depends on D134857
Reviewed By: arsenm, foad
Differential Revision: https://reviews.llvm.org/D134862
Preparation patch for D134354 to make V2S16 G_BUILD_VECTOR legal.
Also removes RegBankInfo's scalarization of small BUILD_VECTORs,
replacing it with InstructionSelector logic instead.
This allows for V2S16 BUILD_VECTOR instructions to survive
all the way to ISel so we can select FMA/MAD_MIX instructions
in D134354.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D134433
Given something like this:
```
declare signext i16 @signext_callee()
define i32 @caller() {
%res = call i16 @signext_callee()
...
}
```
CallLowering would miss that signext_callee's return value is sign extended,
because it isn't on the call.
Use hasRetAttr on the CallBase to allow us to catch this.
(This now inserts G_ASSERT_SEXT/G_ASSERT_ZEXT like in the original review.)
Differential Revision: https://reviews.llvm.org/D86228
The full complement of physical VGPRs for GFX11 is 50% more than GFX10.
Some subtargets have this, others stay the same as GFX10. This affects
occupancy calculations.
Differential Revision: https://reviews.llvm.org/D134522
Remove manual selection for atomic fadd from global-isel.
Stop pre-isel translation to AtomicLoadFAdd/G_ATOMICRMW_FADD
which corresponds to llvm-ir's atomicrmw fadd instruction.
global and flat atomic fadd patterns changes:
Split rtn/no-rtn patterns
Add missing patterns or fix predicates
Remove atomicrmw patterns for v2f16 (atomic rmw doesn't support vectors).
Patterns now check addrspace of pointer, added patterns for flat intrinsic.
with global addrspace pointer that selects into global atomic instruction.
buffer atomic fadd patterns changes:
Rdit patterns to import into global-isel.
Remove gfx6/gfx7 _addr64 and _offset patterns.
Remove patterns that can't be reached (same pattern but different feature).
Differential Revision: https://reviews.llvm.org/D130579
Use same atomicrmw fadd expansion rules for gfx908, gfx940 and gfx11
as for gfx90a. Add missing globalisel legalizer support for flat
atomicrmw fadd f32 on gfx940 and gfx11.
Isel support for gfx11 will be added in D130579.
Differential Revision: https://reviews.llvm.org/D131560
Precommit for D130579 that will remove manual selection and use
patterns from td files. Tests are grouped based on target features.
All patterns have rtn and no-rtn versions.
buffer atomics patterns are selected based on the intrinsic used
(raw or struct) and the offset operand (imm or vgpr):
_offset raw with imm offset
_offen raw with vgpr offset (or large imm offset)
_idxen struct with imm offset
_bothen struct with vgpr offset (or large imm offset)
global and flat atomics are selected via intrinsic or the atomicrmw fadd.
atomicrmw tests have amdgpu-unsafe-fp-atomics=true and non-system scope
since they get expanded otherwise. atomicrmw fadd does not support vector
type, test float and double.
global atomics patterns are selected based on address type via (global or
flat) intrinsic or atomicrmw fadd with global address(addrspace(1)*).
'no suffix' vgpr addrspace(1)* address
_saddr sgpr addrspace(1)* address
flat atomics patterns are selected via (flat)intrinsic or atomicrmw fadd
with flat address (* - address space 0).
Differential Revision: https://reviews.llvm.org/D131561
Due to the encoding changes in GFX11, we had a hack in place that
disables the use of VGPRs above 128. This patch removes the need for
that hack.
We introduce a new register class VGPR_32_Lo128 which is used for 16-bit
operands of VOP1, VOP2, and VOPC instructions. This register class only has the
low 128 VGPRs, but is otherwise identical to VGPR_32. Therefore, 16-bit VOP1,
VOP2, and VOPC instructions are correctly limited to use the first 128
VGPRs, while the other instructions can freely use all 256.
We introduce new pseduo-instructions used on GFX11 which have the suffix
t16 (True 16) to use the VGPR_32_Lo128 register class.
Reviewed By: foad, rampitec, #amdgpu
Differential Revision: https://reviews.llvm.org/D133723
Only do this for 16 and 32 register tuples, although we might want to
extend to 8 tuples.
It's incredibly expensive to spill these, and doing so majorly
interferes with the ability to allocate anything else in the function.
The lit tests show mostly sizeable improvements with a handful of tiny
regressions with large vectors.
Simplify extended add/sub (with carry-in and carry-out) to add/sub with
carry (with carry-out only) if carry-in is known to be zero.
Differential Revision: https://reviews.llvm.org/D133702
We use _oneuse checks to make sure combines won't accidentally
increase code size, but this prevents the optimization in cases where
we happen to want to clamp multiple values to the same range
It's safe to drop these checks for two reasons:
1. The pattern of max/min operations for med3 is complicated enough
it's unlikely to come up by accident, so this will still only fire
when appropriate to do so
2. Even if every intermediate is used and we don't save a single
operation, we still won't end up with more operations since the
med3 replaces the final max/min.
In pathological cases we could potentially end up with a larger
encoding size or possibly slightly increased vgpr pressure, but the
risk of that is low, especially considering the upside.
Differential Revision: https://reviews.llvm.org/D132621
Allows things like `(G_PTR_ADD (G_PTR_ADD a, b), c)` to be
simplified into a single ADD3 instruction instead of two adds.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D131254
Reland commit 719658d078c4
The base RA support infrastructure that only allow a specific register
class be allocated in RA pss. Since greedy RA, basic RA derived from
base RA, they all allow allocating specific register class. Fast RA
doesn't support allocating register for specific register class. This
patch is to enable ShouldAllocateClass in fast RA, so that it can
support allocating register for specific register class.
Differential Revision: https://reviews.llvm.org/D131825
si-annotate-control-flow does depth first traversal of BB's of
a function to insert amdgcn if intrinsics for conditional
branches so that isel can generate correct instructions later.
si-annotate-control-flow checks whether the successor BB for the 'else'
branch of a conditional branch has been visited. If it has been
visited, si-annotate-control-flow assumes the conditional
branch has been handled and will not try to insert if intrinsic
for it.
This assumption is not correct when the IR contains multiple
unreachable BB's. Then 'if' intrinscs are not inserted and incorrect
ISA are generated.
This patch fixes the issue by let amdgpu-unify-divergent-exit-nodes
unify unreachables even if they are uniformly reached. In this way
the IR will not contain multiple exits, and structurizer is able to
structurize the IR containing one unified exit.
Reviewed by: Ruiling Song, Matt Arsenault
Differential Revision: https://reviews.llvm.org/D131181
Fixes: SWDEV-343244
By not clustering loads and adjusting heuristics to more aggressively reduce
register pressure we may be able to increase occupancy for the function if it
was dropped in a first pass scheduling.
Similarly, try to reduce spilling if register usage exceeds lower bound
occupancy.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D130329
Edit the IR input for some codegen tests to simulate what the IR code
sinking pass would do to it. This makes the tests immune to the presence
or absence of the code sinking pass in the codegen pass pipeline, which
does not belong there.
Differential Revision: https://reviews.llvm.org/D130169
This patch merges a consecutive sequence of
s_or_saveexec s_o, s_i
s_xor exec, exec, s_o
into a single
s_andn2_saveexec s_o, s_i instruction.
This patch also cleans up the SIOptimizeExecMasking pass a bit.
Reviewed By: nhaehnle
Differential Revision: https://reviews.llvm.org/D129073
Implement an intrinsic for use lowering LDS variables to different
addresses from different kernels. This will allow kernels that cannot
reach an LDS variable to avoid wasting space for it.
There are a number of implicit arguments accessed by intrinsic already
so this implementation closely follows the existing handling. It is slightly
novel in that this SGPR is written by the kernel prologue.
It is necessary in the general case to put variables at different addresses
such that they can be compactly allocated and thus necessary for an
indirect function call to have some means of determining where a
given variable was allocated. Claiming an arbitrary SGPR into which
an integer can be written by the kernel, in this implementation based
on metadata associated with that kernel, which is then passed on to
indirect call sites is sufficient to determine the variable address.
The intent is to emit a __const array of LDS addresses and index into it.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D125060
This was stored in LiveIntervals, but not actually used for anything
related to LiveIntervals. It was only used in one check for if a load
instruction is rematerializable. I also don't think this was entirely
correct, since it was implicitly assuming constant loads are also
dereferenceable.
Remove this and rely only on the invariant+dereferenceable flags in
the memory operand. Set the flag based on the AA query upfront. This
should have the same net benefit, but has the possible disadvantage of
making this AA query nonlazy.
Preserve the behavior of assuming pointsToConstantMemory implying
dereferenceable for now, but maybe this should be changed.
SelectionDAG has a target hook, getExtendForAtomicOps, which it uses
in the computeKnownBits implementation for ATOMIC_LOAD. This is pretty
ugly (as is having a separate load opcode for atomics), so instead
allow making use of atomic zextload. Enable this for AArch64 since the
DAG path defaults in to the zext behavior.
The tablegen changes are pretty ugly, but partially helps migrate
SelectionDAG from using ISD::ATOMIC_LOAD to regular ISD::LOAD with
atomic memory operands. For now the DAG emitter will emit matchers for
patterns which the DAG will not produce.
I'm still a bit confused by the intent of the isLoad/isStore/isAtomic
bits. The DAG implementation rejects trying to use any of these in
combination. For now I've opted to make the isLoad checks also check
isAtomic, although I think having isLoad and isAtomic set on these
makes most sense.