Remove the MIR scan to detect whether AGPRs are used or not,
and the special case for callable functions. This behavior was
confusing, and not overridable. The amdgpu-no-agpr attribute was
intended to avoid this imprecise heuristic for how many AGPRs to
allocate. It was also too confusing to make this interact with
the pending amdgpu-num-agpr replacement for amdgpu-no-agpr.
Also adds an xfail-ish test where the register allocator asserts
after allocation fails which I ran into.
Future work should reintroduce a more refined MIR scan to estimate
AGPR pressure for how to split AGPRs and VGPRs.
SETULT can be an unsigned less than integer compare or a unordered less
than FP compare. We need to check the VT to distinguish them.
Fixes on of the issues from #128237.
Remove the restriction when reusing wider BROADCAST_LOAD nodes that both nodes couldn't have uses of their load chains - use makeEquivalentMemoryOrdering to merge the chains instead.
This is a copy of #126177, since it was automatically and permanently
closed because I messed up the source branch on my remote
This patch proposes to avoid converting widening recipes to VP
intrinsics during the EVL transform.
IIUC we initially did this to avoid `vl` toggles on RISC-V. However we
now have the RISCVVLOptimizer pass which mostly makes this redundant.
Emitting regular IR instead of VP intrinsics allows more generic
optimisations, both in the middle end and DAGCombiner, and we generally
have better patterns in the RISC-V backend for non-VP nodes. Sticking to
regular IR instructions is likely a lot less work than reimplementing
all of these optimisations for VP intrinsics, and on SPEC CPU 2017 we get
noticeably better code generation.
`N` may get merged with existing nodes inside the loop. Early exit when
it is deleted to avoid the crash.
Alternative solution: use `DAGNodeDeletedListener` to refresh the value
of N.
Closes https://github.com/llvm/llvm-project/issues/128143.
T16D16 table is implemented in
https://github.com/llvm/llvm-project/pull/127673
this is a follow up patch to add load/store pseudo for:
flat_store
global_load/global_store
scratch_load/scratch_store
in true16 mode and updated the codegen test file
This reverts commit 34167f99668ce4d4d6a1fb88453a8d5b56d16ed5.
Different set of verifier errors appears after other regalloc failure
tests with EXPENSIVE_CHECKS.
Similar to what we already do for VPERMV3 nodes - if we're trying to create a new unary variable shuffle and we started with a VPERMV node then always create a new one if it reduces the shuffle chain depth
In some cases after reporting an allocation failure, this would fail
the verifier. It picks the first allocatable register and assigns it,
but didn't update the liveness appropriately. When VirtRegRewriter
relied on the liveness to set kill flags, it would incorrectly add
kill flags if there was another overlapping kill of the virtual
register.
We can't properly assign the register to an overlapping range, so
break the liveness of the failing register (and any other interfering
registers) instead. Give the virtual register dummy liveness by
effectively deleting all the uses by setting them to undef.
The edge case not tested here which I'm worried about is if the read
of the register is a def of a subregister. I've been unable to come up
with a test where this occurs.
https://reviews.llvm.org/D122616
There's no need for a AVX1-only 32/64-bit scalar size limit - if the X86ISD::VBROADCAST node type is supported, X86ISD::VBROADCAST_LOAD will be as well.
#117700 made a change from analyzing all the candidates to analyzing
just the first candidate before deciding to either delete or keep all of
them.
Even though the candidates all have the same instructions, the basic
blocks in which they are present are different and we will need to check
each of them before deciding whether to keep or erase them.
Particularly, `isAvailableAcrossAndOutOfSeq` checks to see if the
register (x5 in this case) is available from the end of the MBB to the
beginning of the candidate and not checking this for each candidate led
to incorrect candidates being outlined resulting in correctness issues
in a few downstream benchmarks.
Similarly, deleting all the candidates if the first one is not viable
will result in missed outlining opportunities.
Remove load and store instructions which do not include an immediate,
and just use the immediate variants in all cases. These variants will be
emitted exactly the same when the immediate offset is 0. Removing the
non-immediate versions allows for the removal of a lot of code and would
make any MachineIR passes simpler.
If we have a shuffle mask which can be represented as two slides + some
conditional masking, we can emit a VLA sequence which is at most
O(2*LMUL). This is essentially a generalization of the existing
isElementRotate, but is staged to only introduce the new match for the
moment. A follow up change will start consolidating code - see the notes
below.
A couple of notes:
1) I'm excluding bit rotates mostly to keep the diffs manageable.
2) The existing isElementRotate logic is nearly redundant after this
change. However, we have some intersection between the bit rotate
and element rotate matching. To keep things simple, I left that in
place for now, and will merge/cleanup in a separate change.
3) The individual asVSlideup and asVSlidedown are closely related, but
the former looks through extracts and the later changes VL. I'm leaving
these in place for now, but hope to common them up a bit as well.
The SDAG version uses fminnum/fmaxnum, in converting it to fcmp+select
it appears the order of the operands was chosen badly. This switches the
conditions used to keep the constant on the RHS.
If we're splatting a subvector using a insert_subvector(insert_subvector(undef,sub,0),sub,c) pattern then permit multiuse of the sub as long as the insert_subvector nodes are the only users.
Despite them being more expensive than other variable mask shuffles, we
were combining shuffle chains to VPERM2W/VPERM2B if any shuffle in the
chain was a variable shuffle - including very cheap shuffles like PSHUFB
or AND mask patterns.
This patch adjusts the BWI VPERMV3 threshold - it still always permits
the merge if the chain (of 2 or more shuffles) contains any
X86ISD::VPERMV/VPERMV3 shuffles (including DQ variants), but otherwise
only reduces the depth threshold based off the number of other variable
shuffles we'd fold away.
Shaders that use the llvm.amdgcn.init.whole.wave intrinsic need to
explicitly preserve the inactive lanes of VGPRs of interest by adding
them as dummy arguments. The code usually looks something like this:
```
define amdgcn_cs_chain void f(active vgpr args..., i32 %inactive.vgpr1, ..., i32 %inactive.vgprN) {
entry:
%c = call i1 @llvm.amdgcn.init.whole.wave()
br i1 %c, label %shader, label %tail
shader:
[...]
tail:
%inactive.vgpr.arg1 = phi i32 [ %inactive.vgpr1, %entry], [poison, %shader]
[...]
; %inactive.vgpr* then get passed into a llvm.amdgcn.cs.chain call
```
Unfortunately, this kind of phi node will get optimized away and the
backend won't be able to figure out that it's ok to use the active lanes
of `%inactive.vgpr*` inside `shader`.
This patch fixes the issue by introducing a llvm.amdgcn.dead intrinsic,
whose result can be used as a PHI operand instead of the poison. This
will be selected to an IMPLICIT_DEF, which the backend can work with.
At the moment, the llvm.amdgcn.dead intrinsic works only on i32 values.
Support for other types can be added later if needed.
This is the follow up to #125026 that keeps mask operands in virtual
register form for as long as possible throughout the backend.
The diffs in this patch are from MachineCSE/MachineSink/RISCVVLOptimizer
kicking in.
The invariant that the mask COPY never has a subreg no longer holds
after MachineCSE (it coalesces some copies), so it needed to be relaxed.
Currently if a user of an instruction isn't a vector pseudo we bail. For
simple non-subreg virtual COPYs, we can peek through their uses by using
a worklist.
This is extracted from a loop in TSVC2 (s273) that contains a fcmp +
select, which produces a copy that doesn't seem to be coalesced away.
This reverts commit 36eaf0daf5d6dd665d7c7a9ec38ea22f27709fed.
This is not a sound approach to dealing with this instruction change.
The new behavior is a different opcode pair, not a modifier on the
existing opcode.
This patch adds codegen for all Arm9.6-a compare-and-branch
instructions, that operate on full w or x registers. The instruction
variants operating on half-words (cbh) and bytes (cbb) are added in a
subsequent patch.
Since CB doesn't use standard 4-bit Arm condition codes but a reduced
set of conditions, encoded in 3 bits, some conditions are expressed by
modifying operands, namely incrementing or decrementing immediate
operands and swapping register operands. To invert a CB instruction it's
therefore not enough to just modify the condition code which doesn't
play particularly well with how the backend is currently organized. We
therefore introduce a number of pseudos which operate on the standard
4-bit condition codes and lower them late during codegen.
Tests have been re-generated with recent scheduler changes.
Original message:
SelectionDAG will not reassociate adds to the end of a chain if
there are multiple users of later additions. This prevents isel
from folding the immediate into a load/store address.
One easy way to see this is accessing an array in a struct with
two different indices. An ADDI will be used to get to the start
of the array then 2 different SHXADD instructions will be used to
add the scaled indices. Finally the SHXADD will be used by different
load instructions. We can remove the ADDI by folding the offset into
each load.
This patch adds a new pass that analyzes how an ADDI constant
propagates through address arithmetic. If the arithmetic is only
used by a load/store and the offset is small enough, we can adjust
the load/store offset and remove the ADDI.
This pass is placed before MachineCSE to allow cleanups if some
instructions become common after removing offsets from their inputs.
This pass gives ~3% improvement on dynamic instruction count on
541.leela_r and 544.nab_r from SPEC2017 for the train data set. There's
a ~1% improvement on 557.xz_r.