Shows failure to fold zero_extend_vector_inreg(and(x, c)) -> bitcast(and(x,c')) when we're only demanding the 0'th extended element, such as with the SSE variable shift ops.
There are multiple possible ways to represent the X - urem X, Y pattern. SCEV was not canonicalizing, and thus, depending on which you analyzed, you could get different results. The sub representation appears to produce strictly inferior results in practice, so I decided to canonicalize to the Y * X/Y version.
The motivation here is that runtime unroll produces the sub X - (and X, Y-1) pattern when Y is a power of two. SCEV is thus unable to recognize that an unrolled loop exits because we don't figure out that the new unrolled step evenly divides the trip count of the unrolled loop. After instcombine runs, we convert the the andn form which SCEV recognizes, so essentially, this is just fixing a nasty pass ordering dependency.
The ARM loop hardware interaction in the test diff is opague to me, but the comments in the review from others knowledge of the infrastructure appear to indicate these are improvements in loop recognition, not regressions.
Differential Revision: https://reviews.llvm.org/D114018
This patch adds PPC back end optimization to analyze the arguments of a
conditional trap instruction to execute one of the following:
1. Delete it if never trap
2. Replace it if always trap
3. Otherwise keep it
Reviewed By: nemanjai, amyk, PowerPC
Differential revision: https://reviews.llvm.org/D111434
For now I've just changed the code to only return true from
AArch64ISelLowering::hasAndNot if the vector is fixed-length.
Once we have the right patterns or DAG combines to use bic/bif
we can also enable this for SVE.
Test added here:
CodeGen/AArch64/vselect-constants.ll
Differential Revision: https://reviews.llvm.org/D113994
Global ctor/dtor can be an empty array, which is a Constant not a
ConstantArray. The cast<ConstantArray> therefore asserts / crashes.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D113800
STATEPOINT instruction behavior is similar to call instruction.
In aarch64 BL instruction implicitly define lr register, so
STATEPOINT instruction should do the same.
However STATEPOINT is a general pseudo instruction and I could not find
a way to override list of implicit defs for specific target.
So this patch post processes inserting STATEPOINT instruction by
adding implisit dead def for lr.
Reviewers: reames, loicottet, ostannard
Reviewed By: reames
Subscribers: danilaml, hiraditya, kristof.beyls, llvm-commits, yrouban
Differential Revision: https://reviews.llvm.org/D111114
This is a first attempt at a constant value consecutive store merging pass,
a counterpart to the DAGCombiner's store merging optimization.
The high level goals of this pass:
* Have a simple and efficient algorithm. As close to linear time as we can get.
Thus, prioritizing scalability of the algorithm over merging every corner case
we can find. The DAGCombiner's store merging code has been the source of
compile time and complexity issues in the past and I wanted to avoid that.
* Don't introduce any new data structures for ordering memory operations. In MIR,
we don't have the concept of chains like we do in the DAG, and the instruction
order is stricter than enforcing ordering with graph edges. Although I
considered adding something similar, I couldn't justify the overhead.
The pass is current split into 3 main parts. The main store merging code focuses
on identifying candidate stores and managing the candidate group that's under
consideration for merging. Analyzing addressing of stores is a potentially
complex part and for now there's just a basic implementation to identify easy
cases. Finally, the other main bit of complexity is the alias analysis, which
tries to follow the same logic as the DAG's AA.
Currently this implementation only supports merging of constant stores. Stores
of arbitrary variables are technically possible with a very small change, but
the DAG chooses not to do this. Doing so here makes most code worse since
there's extra overhead in merging values into wider registers.
On AArch64 -Os, this optimization results in very minor savings on CTMark.
Differential Revision: https://reviews.llvm.org/D109131
These test files are copied directly from AArch64. Some of the cases
may benefit from ANDN with the Zbb extension. Somes cases already
improve use ANDN.
selectcc-to-shiftand.ll also contains tests that test select->and
conversion even when a ANDN isn't needed. I think this improves our
coverage of these optimizations.
Differential Revision: https://reviews.llvm.org/D113935
Fixes PR#48678. `X86TargetLowering::getRegForInlineAsmConstraint()` can adjust the register class to match the type, e.g. change `VR128X` to `VR256X` if the type needs 256 bits. However, the function currently returns the unadjusted register and the adjusted register class, e.g. `xmm15` and `VR256X`, which then causes an assertion failure later because the register class does not contain that register. This patch fixes this behavior.
Reviewed By: pengfei
Differential Revision: https://reviews.llvm.org/D113834
A future change will add SCC liveness checks. Since we are still
relying on forward register scavenging, add dead flags to avoid
spuriously detecting SCC as live.
It's possible that the mask is already a NOT. At least if InstCombine
hasn't canonicalized the input. In that case we will form an ANDN with
X instead of with Y. So we don't need to worry about Y being a constant.
We might need to check that X isn't a constant instead, but we don't
have a test case for that yet.
This fixes a size regression found when trying to enable this combine
for RISCV in D113937.
Differential Revision: https://reviews.llvm.org/D113948
There's really no reason why anyone should use these special names in a variant.
I noticed this while reading the code: all other writes to OS are guarded by
this conditional, and the behavior with the check seems more correct, so
let's add the check.
Differential Revision: https://reviews.llvm.org/D113909
The platform independent ISD::INSERT_VECTOR_ELT take a element index,
but vins* instructions take a byte index. Update 32bit td patterns for
vector insert to handle the element index accordingly.
Since vector insert for non constant index are supported in
ISA3.1, there is no need to use platform specific ISD node,
PPCISD::VECINSERT. Update td pattern to directly use
ISD::INSERT_VECTOR_ELT instead.
Reviewed By: nemanjai, #powerpc
Differential Revision: https://reviews.llvm.org/D113802
This handles the case where the mask register instruction input
comes from a Phi of vsetvlis. If the VLMAX is the same as the VLMAX
required by the mask register instruction, we can avoid a vsetvli.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D113204
The first try at this patch ( bf5748a1af0d ) was reverted ( 5be64d416481 )
because it could crash. The cause of that problem was failing to account
for the optional peek-through-bitcast in the enclosing function.
This version of the patch adds a clause to avoid the fold in case of
bitcasts because it is unlikely to be profitable in that scenario.
A test case based on https://llvm.org/PR52504 was added to make sure
we don't have that problem again.
Original commit message:
and (pcmpgt X, -1), Y --> pandn (vsrai X, BitWidth-1), Y
This avoids the -1 constant vector in favor of an arithmetic shift
instruction if it exists (the ISA is still not complete after all
these years...).
We catch this pattern late in combining by matching PCMPGT, so it
should not interfere with more general folds.
Differential Revision: https://reviews.llvm.org/D113603
If an operand is a bitcasted or widended constant, try to more aggressively create broadcastable constants for folding, which in particular helps non-VLX modes.
I've refactored getAVX512Node so that VLX targets can make better use of this as well.
NOTE: In the future, I think we should consider removing the broadcast of constant data from DAG entirely and move this to either X86InstrInfo::foldMemoryOperand or a new pass - AVX1/2 targets has similar problems with missed (whole vector) folds that need to be improved as well.
Differential Revision: https://reviews.llvm.org/D113845
This casued assertion failures:
llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp:9446:
void llvm::SelectionDAG::ReplaceAllUsesWith(llvm::SDNode *, llvm::SDNode *):
Assertion `(!From->hasAnyUseOfValue(i) || From->getValueType(i) == To->getValueType(i))
&& "Cannot use this version of ReplaceAllUsesWith!"' failed.
See comment on the code review.
(Had to update some expectations in test/CodeGen/X86/vselect-zero.ll
manually due to other changes having landed after the reverted one.)
> and (pcmpgt X, -1), Y --> pandn (vsrai X, BitWidth-1), Y
>
> This avoids the -1 constant vector in favor of an arithmetic shift
> instruction if it exists (the ISA is still not complete after all
> these years...).
>
> We catch this pattern late in combining by matching PCMPGT, so it
> should not interfere with more general folds.
>
> Differential Revision: https://reviews.llvm.org/D113603
This reverts commit bf5748a1af0d2f6f9396d9dc6ac89d15de41eee7.
This helper provides a more complete approach for lowering to X86ISD::PACKSS/PACKUS nodes - testing for existing suitable sign/zero extension before recreating it.
It also optionally packs the upper half instead of the lower half.
On SPARC, S/UMULO operation on 64-bit integers works by extending the value to 128-bit, then doing a multiplication and checking the upper half of the result.
This makes UMULO works correctly by putting a zero in the upper half rather than doing a sign extension.
Reviewed By: LemonBoy
Differential Revision: https://reviews.llvm.org/D110555
This was noted as a follow-up to D113212 / D113426:
4fc1fc4005f7
7e30404c3b6c
11522cfcad6b
https://alive2.llvm.org/ce/z/e4o96b
The canonicalization rules for these IR patterns are complicated,
and we were not matching the expected forms in 2 out of the 3
cases. We can make codegen more robust by matching the swapped
forms (and that will also work if these patterns are created late).
Similar to what we've done for other ops, this patch widens VPTERNLOG to a 512-bit op for non-VLX targets.
Fixes regressions in D113192
Differential Revision: https://reviews.llvm.org/D113827
This modifies the preconditions of TypePromotion's isSafeWrap
method, to allow it to work from all constants from the ICmp.
Using the code:
%a = add %x, C1
%c = icmp ult %a, C2
According to Alive, we can prove that is equivalent to
icmp ult (add zext(%x), sext(C1)), zext(C2) given
C1 <=s 0 and C1 >s C2.
https://alive2.llvm.org/ce/z/CECYZB
Which is similar to what is already present. We can also
prove icmp ult (add zext(%x), sext(C1)), sext(C2) given
C1 <=s 0 and C1 <=s C2.
https://alive2.llvm.org/ce/z/KKgyeL
The PrepareWrappingAdds method was removed, and the
constants are now altered to sext or zext directly as
required by the above methods.
Differential Revision: https://reviews.llvm.org/D113678
The division by constant optimization often produces constants that
are uimm32, but not simm32. These constants require 3 or 4 instructions
to materialize without Zba.
Since these instructions are often used by a multiply with a LHS
that needs to be zero extended with an AND, we can switch the MUL
to a MULHU by shifting both inputs left by 32. Once we shift the
constant left, the upper 32 bits no longer need to be 0 so constant
materialization is free to use LUI+ADDIW. This reduces the constant
materialization from 4 instructions to 3 in some cases while also
reducing the zero extend of the LHS from 2 shifts to 1.
Differential Revision: https://reviews.llvm.org/D113805
These patterns were noted in the recent D113212 and follow-ups.
I did not bother to duplicate every test because it should be
clear if we recognize the swaps from a smaller sample. We have
complete coverage for the original patterns.
Add support for v32i8/v64i8 converting shift-by-constant to multiply-by-constant. This helps us avoid the generic vXi8 shift lowering, and a lot of VPBLENDVB ops which can be particularly slow.
We also needed to reorder a few shift lowering patterns to prevent regressions, particularly for XOP+AVX2 (Excavator) targets (which can split to fast v16i8 shifts) and AVX512-BWI targets (which prefers to extend to fast v32i16 shifts).
The old expansion open-coded a 64-bit addition in a strange way, by
adding the high parts *without* carry-in from the low part, and then
adding the carry back in later on. Fixing this saves a couple of
instructions and makes the code much easier to understand.
Differential Revision: https://reviews.llvm.org/D113679
Currently we only constant fold target shuffles if any of the sources has one use, or it would remove a variable shuffle mask - the aim being to avoid constant pool bloat.
This patch proposes we should constant fold by default and only limit this if optsize is enabled - I've added a basic test for this in vector-mul.ll (the pmuludq case is by far the most common), I can add other specific test cases if people need them.
This should permit further constant folding, break some instruction dependencies and help reduce shuffle port pressure.
Differential Revision: https://reviews.llvm.org/D113748
and (pcmpgt X, -1), Y --> pandn (vsrai X, BitWidth-1), Y
This avoids the -1 constant vector in favor of an arithmetic shift
instruction if it exists (the ISA is still not complete after all
these years...).
We catch this pattern late in combining by matching PCMPGT, so it
should not interfere with more general folds.
Differential Revision: https://reviews.llvm.org/D113603
Register uses that are MRI->isConstantPhysReg() should not inhibit
sinking transformation.
Reviewed By: StephenTozer
Differential Revision: https://reviews.llvm.org/D111531
When compiler converts x87 operations to stack model, it may insert
instructions that pop top stack element. To do it the compiler inserts
instruction FSTP right after the instruction that calculates value on
the stack. It can break the code that uses FPSW set by the last
instruction. For example, an instruction FXAM is usually followed by
FNSTSW, but FSTP is inserted after FXAM. As FSTP leaves condition code
in FPSW undefined, the compiler produces incorrect code.
With this change FSTP in inserted after the FPSW consumer if the last
instruction sets FPSW.
Differential Revision: https://reviews.llvm.org/D113335
This fixes the crash due to lacking VZEXT_MOVL support with i16.
Reviewed By: LuoYuanke, RKSimon
Differential Revision: https://reviews.llvm.org/D113661
This improves our coverage of soft float libcalls lowering.
Remove most of the test cases from rv64i-single-softfloat.ll. They
were duplicated in the test files that now test softflow. Only
a couple test cases for constrained FP remain. Those should be
removed when we start supporting constrained FP.
This is follow up from D113528.
Many of these had an extra 'f' at the beginning of their name that
caused them to not be treated as intrinsics.
I'm not sure what fpround was supposed to be so I deleted it.
frem was changed from an intrinsic to an instruction.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D113528