Calls to `@llvm.abs(undef, i1 true)` and `@llvm.abs(INT_MIN, i1 true)`
can be optimized to `poison` instead of `undef`.
[Alive2](https://alive2.llvm.org/ce/z/Hg-2ug)
Similar to what we do for broadcast shuffles, when legalising load costs, if the value is known to be uniform, then we will only load a single vector and reuse this across the split legalised registers.
Fixes#111126
ConstraintElimination currently only supports decomposing gep nusw with
non-negative indices (with "non-negative" possibly being enforced via
pre-condition).
Add support for gep nuw, which directly gives us the necessary
guarantees for the decomposition.
ValueMap automatically updates entries with the new value if they have
been RAUW. This can lead to instructions that are expected to not have
shape info to be added to the map (e.g. shufflevector as in the added
test case).
This leads to incorrect results. Originally it was used for transpose
optimizations, but they now all use updateShapeAndReplaceAllUsesWith,
which takes care of updating the shape info as needed.
This fixes a crash in the newly added test cases.
PR: https://github.com/llvm/llvm-project/pull/118282
This consists of:
* Make these instructions part of FPMathOperator.
* Adjust bitcode/ir readers/writers to expect fast math flags on these
instructions.
* Make IRBuilder set the fast math flags on these instructions.
* Update langref and release notes.
* Update a bunch of tests. Some of these are due to InstCombineCasts
incorrectly adding fast math flags to fptrunc, which will be fixed in a
later patch.
This patch extends the PGO infrastructure with an option to prefer the
instrumentation of loop entry blocks.
This option is a generalization of
19fb5b467b,
and helps to cover cases where the loop exit is never executed.
An example where this can occur are event handling loops.
Note that change does NOT change the default behavior.
This PR removes tests with `br i1 undef` under
`llvm/tests/Transforms/ObjCARC, Reassociate, SCCP, SLPVectorizer...`.
After this PR, I'll continue to fix tests under `llvm/tests/CodeGen`,
which has more UB tests than `llvm/tests/Transforms`.
This PR adds more realistic cost estimates for these reduction
intrinsics
- `llvm.vector.reduce.umax`
- `llvm.vector.reduce.umin`
- `llvm.vector.reduce.smax`
- `llvm.vector.reduce.smin`
- `llvm.vector.reduce.fadd`
- `llvm.vector.reduce.fmul`
- `llvm.vector.reduce.fmax`
- `llvm.vector.reduce.fmin`
- `llvm.vector.reduce.fmaximum`
- `llvm.vector.reduce.fminimum`
- `llvm.vector.reduce.mul
`
The pre-existing cost estimates for `llvm.vector.reduce.add` are moved
to `getArithmeticReductionCosts` to reduce complexity in
`getVectorIntrinsicInstrCost` and enable other passes, like the SLP
vectorizer, to benefit from these updated calculations.
These are not expected to provide noticable performance improvements and
are rather provided for the sake of completeness and correctness. This
PR is in draft mode pending benchmark confirmation of this.
This also provides and/or updates cost tests for all of these
intrinsics.
This PR was co-authored by me and @JonPsson1 .
Check for nusw instead of inbounds when decomposing GEPs.
In this particular case, we can also look through multiple nusw
flags, because we will ultimately be working in the unsigned
constraint system.
Unsigned icmp of gep nuw folds to unsigned icmp of offsets. Unsigned
icmp of gep nusw nuw folds to unsigned samesign icmp of offsets.
Proofs: https://alive2.llvm.org/ce/z/VEwQY8
* Adds tests for strided accesses.
* Adds tests for reverse loops.
As part of this I've moved one of the negative tests from
load-deref-pred-align.ll into a new file
(load-deref-pred-neg-off.ll) because the pointer type had a
size of 16 bits and I realised it's probably not sensible for
allocas that are >16 bits in size!
Bail out when evaluating an insertelement of a constant expression.
Unlike other ValueLattice kinds, these don't have implicit splat
semantics and we end up with type mismatches. If we actually wanted
to handle these, we should actually evaluate the insertion via
constant folding. I'm not bothering with that, as these should
get constant folded on construction already.
InstSimplify currently folds patterns like `(x | y) uge x` and `(x & y)
ule x` to true. However, it cannot handle combinations of such
situations, such as `(x | y) uge (x & z)` etc.
To support this, recursively collect operands of monotonic instructions
(that preserve either a greater-or-equal or less-or-equal relationship)
and then check whether any of them match.
Fixes https://github.com/llvm/llvm-project/issues/69333.
We had a separate fold that handled just the trivial case where we're
replacing exactly the argument of the select. Handle this in select
value equivalence by relaxing the infinite loop protection to allow a
replacement of a non-constant with a constant.
This also fixes https://github.com/llvm/llvm-project/issues/113301, as
the separate fold did not handle undef values correctly.
SK_PermuteTwoSrc legalization has to assume any of the legalised source registers could be referenced in split shuffles, but if we already know that each 128-bit lane only references elements from the same lane of the source operands, then this scaling won't occur.
Hopefully this can help with #113356 without us having to get full processShuffleMasks canonicalization finished first.
As vector element loads are free on SystemZ, this patch improves the cost
computation in getGatherCost() to reflect this.
getScalarizationOverhead() gets an optional parameter which can hold the actual
Values so that they in turn can be passed (by BasicTTIImpl) to
getVectorInstrCost().
SystemZTTIImpl::getVectorInstrCost() will now recognize a LoadInst and
typically return a 0 cost for it, with some exceptions.
When the size is an appropriate constant, __memcpy_chk will turn into a
memcpy that gets folded away by InstCombine. Therefore this patch avoids
counting these as calls for purposes of inlining costs.
This is only really relevant on platforms whose headers redirect memcpy
to __memcpy_chk (such as Darwin). On platforms that use intrinsics,
memcpy and similar functions are already exempt from call penalties.
Most AArch64 cpus outside of Neoverse V1 (256) and A64FX (512) have an
SVE vector length of 128, and in environments like Android (where no
mcpu option is common) we would expect all cpus to match. This patch
changes the default vector length to 128 with -mcpu=generic, to match
the most common case.
This test checks the costs, not vectorization, so is better placed in the
existing gather/scatter cost modelling tests. An extra neoverse-v2 check line
has been added for both gathers and scatters.
As AliasAnalysis now has support for scalable sizes, add tests to
DeadStoreElimination covering the scalable vectors case, in preparation
to extend it.
lowerDotProduct called above may already lower a matrix multiply and
mark it as procssed by adding it to FusedInsts. Don't try to process it
again in LowerMatrixMultiplyFused by checking if FusedInsts.
Without this change, we trigger an assertion when trying to erase the
same original matrix multiply twice.