We already have patterns for matching fadd(select(..., -0.0)),
but an upcoming patch will lead to patterns using +0.0 as the
identity instead of -0.0. I'm adding support for these patterns
now to avoid any regressions for MVE.
Differential Revision: https://reviews.llvm.org/D127275
D125887 changed the ctlz/cttz despeculation transform to insert
a freeze for the introduced branch on zero. While this does fix
the "branch on poison" issue, we may still get in trouble if we
pick a different value for the branch and for the ctz argument
(i.e. non-zero for the branch, but zero for the ctz). To avoid
this, we should use the same frozen value in both positions.
This does cause a regression in RISCV codegen by introducing an
additional sext. The DAG looks like this:
t0: ch = EntryToken
t2: i64,ch = CopyFromReg t0, Register:i64 %3
t4: i64 = AssertSext t2, ValueType:ch:i32
t23: i64 = freeze t4
t9: ch = CopyToReg t0, Register:i64 %0, t23
t16: ch = CopyToReg t0, Register:i64 %4, Constant:i64<32>
t18: ch = TokenFactor t9, t16
t25: i64 = sign_extend_inreg t23, ValueType:ch:i32
t24: i64 = setcc t25, Constant:i64<0>, seteq:ch
t28: i64 = and t24, Constant:i64<1>
t19: ch = brcond t18, t28, BasicBlock:ch<cond.end 0x8311f68>
t21: ch = br t19, BasicBlock:ch<cond.false 0x8311e80>
I don't see a really obvious way to improve this, as we can't push
the freeze past the AssertSext (which may produce poison).
Differential Revision: https://reviews.llvm.org/D126638
Add new intrinsic and codegen support for the s_sendmsg_rtn_b32 and
s_sendmsg_rtn_b64 instructions.
Differential Revision: https://reviews.llvm.org/D127315
In GFX10 dlc controlled L1 cache bypass. In GFX11 it has been repurposed
to control MALL NOALLOC, and glc controls L1 as well as L0 cache bypass.
Update the documentation and SIMemoryLegalizer accordingly. Set dlc for
nontemporal and volatile accesses.
Differential Revision: https://reviews.llvm.org/D127405
The patch is a replacement of D125199. PseudoReadVL with vtype has worry for
computing same vtypes of VLEFF/VLSEGFF in two different places, DAGToDAG and
InsertVSETVLI. VLEFF/VLSEGFF MI with VL output still could provide the vtype of
VLEFF/VLSEGFF to the users of its VL.
The patch names the new pseudo as original VLEFF/VLSEGFF name suffixed "_VL" and
expand them in RISCVInsertVSETVLI pass.
This patch also reverts commit 4537aae0d57e17c217c192d8977012ba475b130c,
"[RISCV] Make PseudoReadVL have the vtypes of the corresponding VLEFF/VLSEGFF.".
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D126794
For an addition with simm14 and simm15 immediates with 2 or 3 trailing bits,
we can use a shXadd instruction and an addi to do the addition.
This patch teaches RISCVMergeBaseOffset to see through this pattern.
I don't think the sh1add case occurs because we use two addis for that,
but I implemented it for completeness.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D127376
Changes for GFX11:
- Clauses may not mix instructions of different types, and there are
more types. For example image instructions with and without a sampler
are now different types.
- The max size of a clause is explicitly documented as 63 instructions.
Previously it was implicitly assumed to be 64. This is such a tiny
difference that it does not seem worth making it conditional on the
subtarget.
- It can be beneficial to clause stores as well as loads.
Differential Revision: https://reviews.llvm.org/D127391
Nic Curtis done the experiments to prove it is faster than a
separate mul and add.
Fixes: SWDEV-332806
Differential Revision: https://reviews.llvm.org/D127253
- VOP3 and SDWA forms of V_CMPX were not handled
- Hazard only exists if the compare defines EXEC (i.e. V_CMPX)
forwarded to the permlane.
Differential Revision: https://reviews.llvm.org/D127344
This was assuming the vector types were MVTs, but they don't have to be.
Note that the concrete output of the test isn't very useful, since it's
dominated by nonsensical calling convention lowering for the weird types.
Differential Revision: https://reviews.llvm.org/D126505
In order to make sure the stack point is right through the EH region,
we also need to restore stack pointer from the frame pointer if we
don't preserve stack space within prologue/epilogue for outgoing variables,
normally it's just checking the variable sized object is present or not
is enough, but we also don't preserve that at prologue/epilogue when
have vector objects in stack.
Example to show what happened:
```
try {
sp adjust for outgoing args. // 1. Sp changed.
func_call // 2. Exception raised
sp restore // Oh, not restored
} catch {
// 3. And now we are here.
}
// 4. Prepare to return!, restore return address from stack, but...sp is wrong.
// 5. Screw up!
```
Reviewed By: rogfer01
Differential Revision: https://reviews.llvm.org/D126861
This should fix a number of shuffle regressions in D127115 where the re-ordered combines mean we fail to fold a EXTRACT_VECTOR_ELT/INSERT_VECTOR_ELT sequence into a BUILD_VECTOR if we extract from more than one vector source.
The generic legalizer framework is still used to reduce the problem
to scalar multiplication with the bit size a multiple of 32.
Generating optimal code sequences for big integer multiplication is
somewhat tricky and has a number of target-specific intricacies:
- The target has V_MAD_U64_U32 instructions that multiply two 32-bit
factors and add a 64-bit accumulator. Most partial products should
use this instruction.
- The accumulator is mapped to consecutive 32-bit GPRs, and partial-
product multiply-adds can feed the accumulator into each other
directly. (The register allocator's support for that is somewhat
limited, but that only matters for 128-bit integers and larger.)
- OTOH, on some hardware, V_MAD_U64_U32 requires the accumulator
to be stored in an even-aligned pair of GPRs. To avoid excessive
register copies, it makes sense to compute odd partial products
separately from even partial products (where a partial product
src0[j0] * src1[j1] is "odd" if j0 + j1 is odd) and add both
halves together as a final step.
- We can combine G_MUL+G_ADD into a single cascade of multiply-adds.
- The target can keep many carry-bits in flight simultaneously, so
combining carries using G_UADDE is preferable over G_ZEXT + G_ADD.
- Not addressed by this patch: When the factors are sign-extended,
the V_MAD_I64_I32 instruction (signed version!) can be used.
It is difficult to address these points generically:
1) Finding matching pairs of G_MUL and G_UMULH to find a wide
multiply is expensive. We could add a G_UMUL_LOHI generic instruction
and conditionally use that in the generic legalizer, but by itself
this wouldn't allow us to use the accumulation capability of
V_MAD_U64_U32. One could attempt to find matching G_ADD + G_UADDE
post-legalization, but this is also expensive.
2) Similarly, making sense of the legalization outcome of a wide
pre-legalization G_MUL+G_ADD pair is extremely expensive.
3) How could the generic legalizer possibly deal with the
particular idiosyncracy of "odd" vs. "even" partial products.
All this points in the direction of directly emitting an ideal code
sequence during legalization, but the generic legalizer should not
be burdened with such overly target-specific concerns. Hence, a
custom legalization.
Note that the implemented approach is different from that used by
SelectionDAG because narrowing of scalars works differently in
general. SelectionDAG iteratively cuts wide scalars into low and
high halves until a legal size is reached. By contrast, GlobalISel
does the narrowing in a single shot, which should be better for
compile-time and for the quality of the generated code.
This patch leaves three gaps open:
1. When the factors are uniform, we should execute the multiplication on
the SALU. Register bank mapping already ensures this.
However, the resulting code sequence is not optimal because it doesn't
fully use the carry-in capabilities of S_ADDC_U32. (V_MAD_U64_U32
doesn't have a carry-in.) It is very difficult to fix this after the
fact, so we should really use a different legalization sequence in
this case. Unfortunately, we don't have a divergence analysis and so
cannot make that choice.
(This only matters for 128-bit integers and larger.)
2. Avoid unnecessary multiplies when sources are known to be zero- or
sign-extended. The challenge is that the legalizer does not currently
have access to GISelKnownBits.
3. When the G_MUL is followed by a G_ADD, we should consider combining
the two instructions into a single multiply-add sequence, to utilize
the accumulator of V_MAD_U64_U32 fully. (Unless the multiply has
multiple uses and the implied duplication of the multiply is an
overall negative). However, this is also not true when the factors
are uniform: in that case, it is generally better to *not* combine
the two operations, so that the multiply can be done on the SALU.
Again, we don't have a divergence analysis available and so cannot
make an informed choice.
Differential Revision: https://reviews.llvm.org/D124844
This matches what we do in IR. For the RISC-V test case, this allows
us to use -8 for the AND mask instead of materializing a constant in a register.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D127335
During lowering of memcmp/bcmp, the check for a size of 0 is done
in 2 different ways. In rare cases this can lead to a crash in
SystemZSelectionDAGInfo::EmitTargetCodeForMemcmp(). The root cause
is that SelectionDAGBuilder::visitMemCmpBCmpCall() checks for a
constant int value which is not yet evaluated. When the value is
turned into a SDValue, then the evaluation is done and results in
a ConstantSDNode. But EmitTargetCodeForMemcmp() expects the special
case of 0 length to be handled, which results in an assertion.
The fix is to turn the value into a SDValue, so that both functions
use the same check.
Reviewed By: uweigand
Differential Revision: https://reviews.llvm.org/D126900
The ToBeRemoved is used to remove any MachineInstructions that are no
longer needed, making sure we don't invalidate the iterator that is
currently in use by erasing the instruction straight away. This makes
issues for keeping the code in SSA from though, where subsequent
transforms that require SSA form may have been broken by previous
peepholes.
If, instead, we use make_early_inc_range the iteration issue shouldn't
be present, so long as we do not remove the subsequent instruction in
the peephole optimizations. That way the code between transforms is kept
in SSA form, meaning hopefully less things that can go wrong.
Differential Revision: https://reviews.llvm.org/D127296
Add with immediates in the range [-4096, -2049] or [2048, 4095] get
convert to two ADDIs. Teach RISCVMergeBaseOffset to recognize this
pattern as well.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D126843
LUI+ADDIW always produces a simm32. This allows us to always
fold it into a global offset.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D126729
When e.g. a VR64 register is spilled to a stack slot requiring a long
(20-bit) displacement, it is possible to use an FP opcode if the allocated
phys reg allows it. This eliminates the use of a separate LAY instruction.
Reviewed By: Ulrich Weigand
Differential Revision: https://reviews.llvm.org/D115406
Based on D24038.
LLVM has an @llvm.eh.dwarf.cfa intrinsic, used to lower the GCC-compatible __builtin_dwarf_cfa() builtin.
Reviewed By: StephenFan
Differential Revision: https://reviews.llvm.org/D126181
Extend the TypeWidenVector case of PromoteIntRes_BITCAST to work
with TypeSize directly rather than silently casting to unsigned.
To accomplish this I've extended TypeSize with an interface that
essentially allows TypeSize division when both operands have the
same number of dimensions.
There still exists combinations of scalable vector bitcasts that
cause compiler crashes. I call these out by adding "is missing"
entries to sve-bitcast.
Depends on D126957.
Fixes: #55114
Differential Revision: https://reviews.llvm.org/D127126
Bitcasting between unpacked scalable vector types of different
element counts is not a NOP because the live elements are laid out
differently.
01234567
e.g. nxv2i32 = XX??XX??
nxv4f16 = X?X?X?X?
Differential Revision: https://reviews.llvm.org/D126957
Check supportsTypedPointers instead of hasSetOpaquePointersValue when query if has typed ptr.
Reviewed By: beanz
Differential Revision: https://reviews.llvm.org/D127268
Spliter will try to extend a live range into `r` slot for a use operand,
that's works on most situaion, however that not work correctly when the operand
has tied to def, and the def operand is early clobber.
Give an example to demo what's wrong:
0 %0 = ...
16 early-clobber %0 = Op %0 (tied-def 0), ...
32 ... = Op %0
Before extend:
%0 = [0r, 0d) [16e, 32d)
The point we want to extend is 0d to 16e not 16r in this case, but if
we use 16r here we will extend nothing because that already contained
in [16e, 32d).
This patch add check for detect such case and adjust the extend point.
Detailed explanation for testcase: https://reviews.llvm.org/D126047
Reviewed By: MatzeB
Differential Revision: https://reviews.llvm.org/D126048
isDef32 would attempt to make a guess at which SelectionDag nodes were
32bit sources, and use the nature of 32bit AArch64 instructions
implicitly zeroing the upper register half to not emit zext that were
expected to already be zero. This was a bit fragile though, needing to
guess at the correct opcodes that do not become 32bit defs later in
ISel.
This patch removed isDef32, relying on the AArch64MIPeephole optimizer
to remove redundant SUBREG_TO_REG nodes. A part of
SelectArithExtendedRegister was left with the same logic as a heuristic
to prevent some regressions from it picking less optimal sequences.
The AArch64MIPeepholeOpt pass also needs to be taught that a COPY from a
FPR will become a FMOVSWr, which it lowers immediately to make sure that
remains true through register allocation.
Fixes#55833
Differential Revision: https://reviews.llvm.org/D127154
As noticed on D127115 - we were missing this fold, instead just having the shuffle(shuffle(x,undef,splatmask),undef) fold. We should be able to merge these into one using SelectionDAG::isSplatValue, but we'll need to match the shuffle's undef handling first.
This also exposed an issue in SelectionDAG::isSplatValue which was incorrectly propagating the undef mask across a bitcast (it was trying to just bail with a APInt::isSubsetOf if it found any undefs but that was actually the wrong way around so didn't fire for partial undef cases).
i64 indices aren't supported on Zve32*. Scalarize gathers to prevent
generating illegal instructions.
Since InstCombine will aggressively canonicalize GEP indices to
pointer size, we're pretty much always going to have an i64 index.
Trying to predict when SelectionDAG will find a smaller index from
the TTI hook used by the ScalarizeMaskedMemIntrinPass seems fragile.
To optimize this we probably need an IR pass to rewrite it earlier.
Test RUN lines have also been added to make sure the strided load/store
optimization still works.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D127179
Use the query that doesn't assert if TracksLiveness isn't set, which
needs to always be available. We also need to start printing liveins
regardless of TracksLiveness.
First step towards enabling shuffle combining starting from VSELECT/BLENDV nodes - this should eventually help improve the codegen reported at Issue #54819