This will make it easy for callers to see issues with and fix up calls
to createTargetMachine after a future change to the params of
TargetMachine.
This matches other nearby enums.
For downstream users, this should be a fairly straightforward
replacement,
e.g. s/CodeGenOpt::Aggressive/CodeGenOptLevel::Aggressive
or s/CGFT_/CodeGenFileType::
No need to special case add 0, N. SelectionDAG::getNode contains the canonicalization and simplification for this case, so no need to duplicate it here.
A splat index means the operation is reading from (writing to) the same
memory location. Generally, zero is the cheapest value to splat. As
such, we'd prefer to add the splatted value to the base, and use a
constant zero as the index operand.
By default, `DAGCombiner` folds `sext x` to `zext x` when `x` is
non-negative. It will generate redundant `zext` inst seq on riscv64
(typically `slli (srli x, 32), 32`).
godbolt: https://godbolt.org/z/osf6adP1o
This patch applies the transform iff `zext` is **cheaper** than `sext`.
Assuming the ADD is nsw then it may be sign-extended to merge with a SHL op in a similar fold to the existing (shl (add x, c1), c2) -> (add (shl x, c2), c1 << c2) fold.
This is most useful for helping to expose address math for X86, but has also touched several aarch64 test cases as well.
Alive2: https://alive2.llvm.org/ce/z/2UpSbJ
Differential Revision: https://reviews.llvm.org/D159198
Assuming the ADD is nsw then it may be sign-extended to merge with a SHL op in a similar fold to the existing (shl (add x, c1), c2) -> (add (shl x, c2), c1 << c2) fold.
This is most useful for helping to expose address math for X86, but has also touched several aarch64 test cases as well.
Alive2: https://alive2.llvm.org/ce/z/2UpSbJ
Differential Revision: https://reviews.llvm.org/D159198
In some cases where the same mask is used for multiple
extending masked loads it can be more efficient to combine
the zero- or sign-extend into the load even if it's not a
legal or custom operation. This leads to splitting up the
extending load into smaller parts, which also requires
splitting the mask. For SVE at least this improves the
performance of the SPEC benchmark x264 slightly on
neoverse-v1 (~0.3%), and at least one other benchmark
improves by around 30%. The uplift for SVE seems due to
removing the dependencies (vector unpacks) introduced
between the loads and the vector operations, since this
should increase the level of parallelism.
See tests:
CodeGen/AArch64/sve-masked-ldst-sext.ll
CodeGen/AArch64/sve-masked-ldst-zext.ll
https://reviews.llvm.org/D159191
From the discussion in https://reviews.llvm.org/D158853, moving the truncate
into the splat helps more splatted scalar operands get selected on RISC-V, and
also avoids the need for splat_vector_parts on RV32.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D159147
CMP(A,C)||CMP(B,C) => CMP(MIN/MAX(A,B), C)
CMP(A,C)&&CMP(B,C) => CMP(MIN/MAX(A,B), C)
If the operands are proven to be non NaN, then the optimization can be applied
for all predicates.
We can apply the optimization for the following predicates for FMINNUM/FMAXNUM
(for quiet and signaling NaNs) and for FMINNUM_IEEE/FMAXNUM_IEEE if we can prove
that the operands are not signaling NaNs.
- ordered lt/le and ||
- ordered gt/ge and ||
- unordered lt/le and &&
- unordered gt/ge and &&
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D155267
This removes some diffs created by D153502.
I'm assuming an AND/OR won't be worse than an SMIN/SMAX. For
RISC-V at least, AND/OR can be a shorter encoding than SMIN/SMAX.
It's weird that we have two different functions responsible for
folding logic of setccs, but I'm not ready to try to untangle that.
I'm unclear if the PowerPC chang is a regression or not. It looks
like it might use more registers, but I don't understand PowerPC
register so I'm not sure.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D158292
After recent patch D30189, #64323's error message become a new one.
When DAGCombiner was optimizing `(vextract (scalar_to_vector val, 0) -> val`, it didn't
consider the possibility that the inserted value type has less bit than the dest type.
This patch fixes that.
Reviewed By: pengfei
Differential Revision: https://reviews.llvm.org/D158355
D152276 wasn't handling the case where the inserted element is implicitly truncated into the vector - resulting in a i1 element (implicitly truncated from i8) overwriting 8 bits instead of 1 bit.
This patch is intended to be merged into 17.x so I've just disallowed any vector element vs inserted element type mismatch - technically we could be more elegant and permit truncated stores (as long as the store is still byte sized), but the use cases for that are so limited I'd prefer to play it safe for now.
Candidate patch for #64655 17.x merge
Differential Revision: https://reviews.llvm.org/D158366
Targets may lose some optimization opportunities for certain vector operation
if we reduce BUILD_VECTOR to BITCAST early.
And if VT is not legal, reduce BUILD_VECTOR to BITCAST before LegailizeTypes
can get benefit. Because type-legalizer often scalarizes illegal type of vectors.
Reviewed By: sebastian-ne
Differential Revision: https://reviews.llvm.org/D156645
On the extract_subvector side, we already have the restriction. With D158201, we'd start getting unprofitable splat combines unless we add the same one on the extract_subvector side.
Differential Revision: https://reviews.llvm.org/D158202
We have an existing DAG combine for when an insert/extract subvector pair is entirely a nop, but we hadn't handled the case where the net result was either an insert or an extract (but not both). The transform is restricted to index = 0 to avoid having to adjust indices after the transform.
Differential Revision: https://reviews.llvm.org/D158201
We have an existing DAG combine for when an insert/extract subvector pair is entirely a nop, but we hadn't handled the case where the net result was either an insert or an extract (but not both). The transform is restricted to index = 0 to avoid having to adjust indices after the transform.
Reviews, a couple comments on the test changes:
* Mostly RISCV, mostly schedule reordering.
* One real regression in splats-with-mixed-vl.ll due to a different overly aggressive combine, fix in a follow up patch.
* The test/CodeGen/X86/vector-replicaton-i1-mask.ll diff looked concerning at first, but not the mask size at most 4 i1s. I think the type changes on the mask loads are correct, but would welcome a second opinion with someone more familiar with AVX512 codegen.
Differential Revision: https://reviews.llvm.org/D158201
RISC-V found a case where the CombineTo caused N to be CSEd with
an existing node and then deleted. The top level DAGCombiner loop
was surprised to find a node was deleted, but SDValue() was returned
from the visit function.
We need to return SDValue(N, 0) to tell the top level loop that
a change was made, but the worklist updates were already handled.
Fixes#64772.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D158208
Move the sra(shl(x,c1),c1) -> sign_extend_inreg(x) fold inside SimplifyDemandedBits so we can recognize hidden splats with DemandedElts masks.
Because the c1 shift amount has multiple uses, hidden splats won't get simplified to a splat constant buildvector - meaning the existing fold in DAGCombiner::visitSRA can't fire as it won't see a uniform shift amount.
I also needed to add TLI preferSextInRegOfTruncate hook to help keep truncate(sign_extend_inreg(x)) vector patterns on X86 so we can use PACKSS more efficiently.
Differential Revision: https://reviews.llvm.org/D157972
We already do this in getNode, but the undef might appear during
another DAGCombine.
While here remove code for handling noop truncates. getNode checks
the types and won't a noop truncate.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D157910
Original commit didn't handle the case where one of the stores was a
truncating store of the build_vector. The existing codepath produced
wrong code (which thankfully also failed asserts) instead of guarding
against unexpected types. Original commit message follows..
Ran across this when making a change to RISCV memset lowering. Seems
very odd that manually merging a store into a vector prevents it from
being further merged.
Differential Revision: https://reviews.llvm.org/D156349
This happens when CMP1 and CMP3 have the same predicate (or CMP2 and CMP3 have
the same predicate).
This helps optimizations such as the fololowing one:
CMP(A,C)||CMP(B,C) => CMP(MIN/MAX(A,B), C)
CMP(A,C)&&CMP(B,C) => CMP(MIN/MAX(A,B), C)
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D156215
Idx's type can be different from Ptr's, causing a "Binary operator types must match" assertion failure when emitting the MUL.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D156972
Ran across this when making a change to RISCV memset lowering. Seems very odd that manually merging a store into a vector prevents it from being further merged.
Differential Revision: https://reviews.llvm.org/D156349
hasPredecessorHelper method, that is used by DAGCombiner to combine to pre-indexed and post-indexed load/stores, is a major source of slowdown while compiling a large function with MSan enabled on Arm. This patch caps the DFS-graph traversal for this method to 8192 which cuts compile time by 50% (4m -> 2m compile time) at the cost of less overall nodes combined.
Here's the summary of pre-index DAG nodes created and time it took to compile the pathological case with different MaxDepth limit:
1. With MaxDepth = 0 (unlimited): 1800, took 4m
2. With MaxDepth = 32k, 560, took 2m31s
3. With MaxDepth = 8k, 139, took 2m.
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D154885
During the construction of SelectionDAG, there are no explicit canonicalization rules to adjust the order of operands for AND nodes. This may prevent the optimization in DAGCombiner::visitANDLike from being triggered. This patch canonicalizes the operands before matches, which can be observed to improve optimization on the RISC-V target architecture.
Canonicalize:
```
and(x, add) -> and(add, x)
```
Signed-off-by: WANG Rui <wangrui@loongson.cn>
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D154760
The gain is usually suffiscient to go the extra mile and reconstruct a carry in some cases.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D154533
We currently don't extract vector elements from multi-use build vectors unless TLI.aggressivelyPreferBuildVectorSources accepts them, which seems a little extreme for constant build vectors (especially as under some cases ComputeKnownBits will indirectly extract the data for us).
This is causing a few regressions in some upcoming SimplifyDemandedBits work I'm looking at, all of which just need to know that the element is zero, so I've tweaked the fold to accept zero elements as well, which will typically fold very easily.
Differential Revision: https://reviews.llvm.org/D155582
CMP(A,C)||CMP(B,C) => CMP(MIN/MAX(A,B), C)
CMP(A,C)&&CMP(B,C) => CMP(MIN/MAX(A,B), C)
This first patch handles integer types.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D153502
Inspired by some of the cases from D145468
Let SimplifyDemandedBits handle the narrowing of lshr to half-width if we don't require the upper bits, the narrowed shift is profitable and the zext/trunc are free.
A future patch will propose the equivalent shl narrowing combine.
Differential Revision: https://reviews.llvm.org/D146121