In the reordered RHS path of matchesShlZExt, the code never checked that
each shift amount (0, Stride, 2×Stride, …) appears at most once. When
the same shift appeared in multiple lanes, it still filled Order,
producing a non-permutation (e.g. Order = [0,0,0,1]). That led to bad
shuffle masks and miscompilation (e.g. shuffles with poison).
The patch adds an explicit duplicate check: before setting Order[Idx] =
Pos, it ensures Pos has not been seen before, using a SmallBitVector
SeenPositions(VF). If a position is seen twice, the function returns
false and the optimization is not applied.
Converts reduced or(select %cmp, bitmask, 0) to zext(bitcast %vector_cmp to
i<num_reduced_values>) to in
Reviewers: RKSimon, hiraditya
Pull Request: https://github.com/llvm/llvm-project/pull/181940
Some of the zext i1 (cmp) + select sequences can be transformed by
inverting compare predicates to remove extra shuffles, like
zext 1 (cmp ne) + select (cmp eq), 0, 2 can be modeled as select <2
x > (cmp ne), <1, 2>, zeroinitializer
Reviewers: RKSimon, hiraditya
Pull Request: https://github.com/llvm/llvm-project/pull/181580
If the revec is enabled, may have the number of parts (registers) for
the combined node, not a single element node, so need to check for
potential out-of-bounds access
Fixes#181798
The patch changes the maximum tree size analysis. 1. Do not increase
depth for type changing nodes (like casts and compares), allowing more
deeper trees to be built. 2. Removes NotProfitableForVectorization
workaround, not needed anymore after throttling enabled
Reviewers: hiraditya, RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/180950
Added basic estimations for the external uses, when calculating the cost
of the non-profitable trees. Excluding stores/insertelement, as thay are
very good candidates for the vectorization. Also, tuned
buildvector/gather cost with minimum bitwidth analysis data.
Reviewers: hiraditya, RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/178024
If the gathered loads nodes are deleted for deletion, need to actually
deleted them from tree. Also, if the remaining tree is too short
(buildvector + gather node), need to skip such trees to avoid hanging.
Fixes#180846
If the instructions are compatible but non-matching (zext-select pair as
example), no need to perform operands analysis, just return that they
are matching.
When vectorising calls to math intrinsics such as llvm.pow we
correctly detect and generate calls to the corresponding vector
math variant. However, we don't pick up and use the calling
convention for the vector math function. This matters for veclibs
such as ArmPL where the aarch64_vector_pcs calling convention
can improve codegen by reducing the number of registers that
need saving across calls.
Model zext i1 %x to in as select i1 %x, in 1, in 0 in case, if there are
other select instructions, which can be combined into a bundle.
Fixes#178403
Recommit after revert in 993e1f66afcfe9da03bd813e669eada341b11d2f
Reviewers: hiraditya, RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/180635
Model zext i1 %x to in as select i1 %x, in 1, in 0 in case, if there are
other select instructions, which can be combined into a bundle.
Fixes#178403
Reviewers: hiraditya, RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/180635
LoadCombine pattern handling was added as a workaround for the cases,
where the SLP vectorizer could not vectorize the code effectively. With
the copyables support, it can handle it directly.
Also, patch adds support for scalar loads[ + bswap] pattern for byte
sized loads (+ reverse bytes for bswap)
Recommit after revert in 6377c86d718232fe60c548dfd7ab439f7ff84df7
Reviewers: RKSimon, hiraditya
Pull Request: https://github.com/llvm/llvm-project/pull/174205
LoadCombine pattern handling was added as a workaround for the cases,
where the SLP vectorizer could not vectorize the code effectively. With
the copyables support, it can handle it directly.
Also, patch adds support for scalar loads[ + bswap] pattern for byte
sized loads (+ reverse bytes for bswap)
Reviewers: RKSimon, hiraditya
Pull Request: https://github.com/llvm/llvm-project/pull/174205
This patch restructures Find(First|Last)IV handling. Instead of
differentiating between FindLast, FindFirstIV and FindLastIV up front,
this patch simplifies the logic in IVDescriptor to just identify the
FindLast pattern up-front.
It then adds a new VPlan transformation to optimize FindLast reductions
to FindIV reductions if there is a suitable sentinel value.
Find(Last|First)IV recurrence kinds to a single FindIV kind.
This is simpler and more accurate, given selecting the first/last
induction of the final IV reduction is directly controlled by the
corresponding recurrence kind of the ComputeReductionResult.
The new structure also allows further optimizations, like vectorizing
FindLastIV with another boolean reduction that tracks if the condition
in the loop was ever true, if there is no suitable sentinel value.
PR: https://github.com/llvm/llvm-project/pull/177870
Currently `RangeSizes` is used to allow us to skip trying to vectorize
clearly unprofitable trees by caching prior attempts `TreeSizes`. This
PR refactors that logic to simplify and improve readability. This will
make it easier to handle the strided stores.
Switches RangeSizes to use `first` as the location to lookup values from, and `second` as the location to store values to. `first` gets updated by `second` at the appropriate times to match the behavior prior to this change.
Before casting the value to FP type, need to check, if the type for
reduced during minbitwidth analysis and need to restore the original
source type to generate correct bitcast operation.
Fixes#178884
If the reduction forms reversed bitcast, we can represent it as
a bitcast + bswap, if the source elements are byte sized
Reviewers: hiraditya, RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/178513
Added support for reorder reduction of shl(zext)-like construct. Such
constructs are modelled currently as shuffle + bitcast.
Reviewers: RKSimon, hiraditya
Pull Request: https://github.com/llvm/llvm-project/pull/178292
Gathered loads forming DAG instead of trees in SLP vectorizer. When
doing the throttling analysis for such graphs, need to consider partially
matched gathered loads DAG nodes and consider extract and/or gather
operations and their costs.
The patch adds this analysis and allows cutting off the expensive
sub-graphs with gathered loads.
Reviewers: hiraditya, RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/177855
Recommit after revert in d733771113339608aff6002d1fa89aaf4a51c502, which
was related to a crash in SelectionDAG
…The cose modeling logic was attempting to set a bit in APInt for an
out-of-bounds index, causing an assertion failure. This patch ignores
OOB indices as they produce poison- which is already handled.
Fixes#176780
this is the same test result which produces this bug
<img width="1600" height="964" alt="image"
src="https://github.com/user-attachments/assets/80593902-9d15-4e18-850b-a558bca8518e"
/>
Patch models the cost and lowering of disjoint or reduction of shl(zext,
(0, stride, 2* stride)) as bitcast via modeling as combined ops.
Reviewers: RKSimon, hiraditya
Pull Request: https://github.com/llvm/llvm-project/pull/177041
Gathered loads forming DAG instead of trees in SLP vectorizer. When
doing the throttling analysis for such graphs, need to consider partially
matched gathered loads DAG nodes and consider extract and/or gather
operations and their costs.
The patch adds this analysis and allows cutting off the expensive
sub-graphs with gathered loads.
Reviewers: hiraditya, RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/177855
If multiple nodes are generated from same PHI node for the same block,
still need to vectorize vector nodes, even if the value for the incoming block was already emitted.
Fixes#177124
If the copyables have parents, used in PHI nodes, this causes complex
schedulable/non-schedulable dependecies, which require complex
processing, but with small profitability. Cut such case early for now to
prevent compiler crashes and compile time blow up.
Fixes#176658
`OwningArrayRef` has several problems.
The naming is strange: `ArrayRef` is specifically a non-owning view, so
the name means "owning non-owning view".
It has a const-correctness bug that is inherent to the interface.
`OwningArrayRef<T>` publicly derives from `MutableArrayRef<T>`. This
means that the following code compiles:
```c++
void const_incorrect(llvm::OwningArrayRef<int> const a) {
a[0] = 5;
}
```
It's surprising for a non-reference type to allow modification of its
elements even when it's declared `const`. However, the problems from
this inheritance (which ultimately stem from the same issue as the weird
name) are even worse. The following function compiles without warning
but corrupts memory when called:
```c++
void memory_corruption(llvm::OwningArrayRef<int> a) {
a.consume_front();
}
```
This happens because `MutableArrayRef::consume_front` modifies the
internal data pointer to advance the referenced array forward. That's
not an issue for `MutableArrayRef` because it's just a view. It is an
issue for `OwningArrayRef` because that pointer is passed as the
argument to `delete[]`, so when it's modified by advancing it forward it
ceases to be valid to `delete[]`. From there, undefined behavior occurs.
It is less convenient than `llvm::SmallVector` for construction. By
combining the `size` and the `capacity` together without going through
`std::allocator` to get memory, it's not possible to fill in data with
the correct value to begin with. Instead, the user must construct an
`OwningArrayRef` of the appropriate size, then fill in the data. This
has one of two consequences:
1. If `T` is a class type, we have to first default construct all of the
elements when we construct `OwningArrayRef` and then in a second pass we
can assign to those elements to give what we want. This wastes time and
for some classes is not possible.
2. If `T` is a built-in type, the data starts out uninitialized. This
easily forgotten step means we access uninitialized memory.
Using `llvm::SmallVector`, by constrast, has well-known constructors
that can fill in the data that we actually want on construction.
`OwningArrayRef` has slightly different performance characteristics than
`llvm::SmallVector`, but the difference is minimal.
The first difference is a theoretical negative for `OwningArrayRef`: by
implementing in terms of `new[]` and `delete[]`, the implementation has
less room to optimize these calls. However, I say this is theoretical
because for clang, at least, the extra freedom of optimization given to
`std::allocator` is not yet taken advantage of (see
https://github.com/llvm/llvm-project/issues/68365)
The second difference is slightly in favor of `OwningArrayRef`:
`sizeof(llvm::SmallVector<T>) == sizeof(void *) * 3` on pretty much any
implementation, whereas `sizeof(OwningArrayRef) == sizeof(void *) * 2`
which seems like a win. However, this is just a misdirection of the
accounting costs: array-new sticks bookkeeping information in the
allocated storage. There are some cases where this is beneficial to
reduce stack usage, but that minor benefit doesn't seem worth the costs.
If we actually need that optimization, we'd be better served by writing
a `DynamicArray` type that implements a full vector-like feature set
(except for operations that change the size of the container) while
allocating through `std::allocator` to avoid the pitfalls outlined
earlier.