The vector combiner will process all instructions as it first loops
through the function, adding any newly added and deleted instructions to
a worklist which is then processed when all nodes are done. These leaves
extra uses in the graph as the initial processing is performed, leading
to sub-optimal decisions being made for other combines. This changes it
so that trivially dead instructions are removed immediately. The main
changes that this requires is to make sure iterator invalidation does not
occur.
Fixes#153012
As we tolerate unfoldable constant expressions in `scalarizeOpOrCmp`, we
may fold
```llvm
define void @bug(ptr %ptr1, ptr %ptr2, i64 %idx) #0 {
entry:
%158 = insertelement <2 x i64> <i64 5, i64 ptrtoint (ptr @val to i64)>, i64 %idx, i32 0
%159 = or disjoint <2 x i64> splat (i64 2), %158
store <2 x i64> %159, ptr %ptr2
ret void
}
```
to
```llvm
define void @bug(ptr %ptr1, ptr %ptr2, i64 %idx) {
entry:
%.scalar = or disjoint i64 2, %idx
%0 = or <2 x i64> splat (i64 2), <i64 5, i64 ptrtoint (ptr @val to i64)>
%1 = insertelement <2 x i64> %0, i64 %.scalar, i64 0
store <2 x i64> %1, ptr %ptr2, align 16
ret void
}
```
And it would be folded back in `foldInsExtBinop`, resulting in an
infinite loop.
This patch forces scalarization iff InstSimplify can fold the constant
expression.
Attempt to narrow a phi of shufflevector instructions where the two
incoming values have the same operands but different masks.
Related to #128938.
---------
Co-authored-by: Leon Clark <leoclark@amd.com>
Reopen#128938.
Attempt to shrink the size of vector loads where only some of the
incoming lanes are used for rebroadcasts in shufflevector instructions.
---------
Co-authored-by: Leon Clark <leoclark@amd.com>
Co-authored-by: Simon Pilgrim <llvm-dev@redking.me.uk>
Attempt to shrink the size of vector loads where only some of the incoming lanes are used for rebroadcasts in shufflevector instructions.
---------
Co-authored-by: Leon Clark <leoclark@amd.com>
Co-authored-by: Simon Pilgrim <llvm-dev@redking.me.uk>
Using GEP to index into a vector is not disallowed, but not recommended.
The SPIR-V backend needs to generate structured access into types, which
is impossible with an untyped GEP instruction unless we add more info to
the IR. Finding a solution is a work-in-progress, but in the meantime,
we'd like to reduce the amount of failures.
Preventing this optimizations from rewritting extract/insert
instructions into a GEP helps us lower more code to SPIR-V. This change
should be OK as it's only active when targeting SPIR-V and disabling a
non-recommended transformation.
Related to #145002
This patch generalizes the existing foldBitOpOfBitcasts optimization in the VectorCombine pass to handle additional cast operations beyond just bitcast.
Fixes: [#146037](https://github.com/llvm/llvm-project/issues/146037)
Summary
The optimization now supports folding bitwise operations (AND/OR/XOR)
with the following cast operations:
- bitcast (original functionality)
- trunc (truncate)
- sext (sign extend)
- zext (zero extend)
The transformation pattern is:
bitop(castop(x), castop(y)) -> castop(bitop(x, y))
This reduces the number of cast instructions from 2 to 1, improving
performance on targets where cast operations
are expensive or where performing bitwise operations on narrower types
is beneficial.
Implementation Details
- Renamed foldBitOpOfBitcasts to foldBitOpOfCastops to reflect broader
functionality
- Extended pattern matching to handle any CastInst operation
- Added validation for each cast type's constraints (e.g., trunc
requires source > dest)
- Updated cost model to use the actual cast opcode
- Preserves IR flags from original instructions
- Handles multi-use scenarios appropriately
Testing
- Added comprehensive tests in
test/Transforms/VectorCombine/bitop-of-castops.ll
- Tests cover all supported cast types with all bitwise operations
- Includes negative tests for unsupported patterns
- All existing VectorCombine tests pass
getVectorInstrCostHelper would return costs of zero for vector
inserts/extracts that move data between GPR and vector registers, if
there was no 'real' use, i.e. there was no corresponding existing
instruction.
This meant that passes like LoopVectorize and SLPVectorizer, which
likely are the main users of the interface, would understimate the cost
of insert/extracts that move data between GPR and vector registers,
which has non-trivial costs.
The patch removes the special case and only returns costs of zero for
lane 0 if it there is no need to transfer between integer and vector
registers.
This impacts a number of SLP test, and most of them look like general
improvements.I think the change should make things more accurate for any
AArch64 target, but if not it could also just be Apple CPU specific.
I am seeing +2% end-to-end improvements on SLP-heavy workloads.
PR: https://github.com/llvm/llvm-project/pull/146526
Add a new scalarization transform that tries to convert extracts of a
vector ZExt to a set of scalar shift and mask operations. This can be
profitable if the cost of extracting is the same or higher than the cost
of 2 scalar ops. This is the case on AArch64 for example.
For AArch64,this shows up in a number of workloads, including av1aom,
gmsh, minizinc and astc-encoder.
PR: https://github.com/llvm/llvm-project/pull/142976
Some intrinsics like llvm.abs or llvm.powi have a scalar argument even
when the overloaded type is a vector.
This patch handles these in scalarizeOpOrCmp to allow scalarizing them.
In the test the leftover vector powi isn't folded away to poison, this
should be fixed in a separate patch.
Currently, LLVM fails to convert certain pblendvb intrinsics into select
instructions when the blend mask is derived from complex boolean logic
operations. This occurs even when the mask is ultimately based on
sign-extended comparison results, preventing further optimization
opportunities.
Fixes#66513
---------
Co-authored-by: Simon Pilgrim <llvm-dev@redking.me.uk>
The shuffle merging code assumes that the shuffle sources are all the
same type, which fails if we've changed length and don't have 2 inner
shuffles. We already handle length-changing shuffles if we do have 2
inner shuffles.
This patch creates a fake "all poison" shuffle mask and reuses the other
shuffle's sources, which can be safely used with the existing merge
code.
The alternative was a considerable refactor of the merge code to account
for different vector widths......
Fixes#144656
This adds support for unary operands, and unary + ternary intrinsics in
scalarizeOpOrCmp (FKA scalarizeBinOpOrCmp).
The motivation behind this is to scalarize more intrinsics in
VectorCombine rather than in DAGCombine, so we can sink splats across
basic blocks: see https://github.com/llvm/llvm-project/pull/137786
The main change required is to generalize the existing VecC0/VecC1 rules
across n-ary ops:
- An operand can either be a constant vector or an insert of a scalar
into a constant vector
- If it's an insert, the index needs to be static and in bounds
- If it's an insert, all indices need to be the same across all operands
- If all the operands are constant vectors, bail as it will get constant
folded anyway
This addresses a TODO where previously scalarizeBinopOrCmp
conservatively bailed if one of the operands was a load.
getVectorInstrCost was updated to take in values in
https://reviews.llvm.org/D140498 so we can pass in the scalar value to
be inserted, which should return an accurate cost for a gather.
To prevent regressions on x86 this tries to constant fold NewVecC up
front so we can pass it into TTI and get a more accurate cost.
We want to remove this restriction on RISC-V since this is always
profitable whether or not the scalar is a load.
Currently VectorCombine can scalarize vector compares and binary ops.
This extends it to also scalarize binary-op like intrinsics like umax,
minnum etc.
The motivation behind this is to scalarize more intrinsics in
VectorCombine rather than in DAGCombine, so we can sink splats across
basic blocks: see #137786
This currently has very little effect on generated code because
InstCombine doesn't yet canonicalize binary intrinsics where one operand
is a constant into the form that VectorCombine expects, i.e. `binop
(shuffle insert) const --> shuffle (binop insert const)`. The plan is to
land this first and then in a subsequent patch teach InstCombine to do
the canonicalization to avoid regressions in the meantime.
This uses `isTriviallyVectorizable` to determine whether or not an
intrinsic is safe to scalarize. There's also `isTriviallyScalarizable`,
but this seems more geared towards the Scalarizer pass and includes
intrinsics with multiple return values.
It also only handles intrinsics with two operands with the same type as
the return type. In the future we would generalize this to handle
arbitrary numbers of operands, including unary operators too, e.g. fneg
or fma, as well as different operand types, e.g. powi or scmp
The intent of this code is to split larger vectors into smaller shuffles, but
it currently triggering on some small vector types. Limit it to vectors of size
>128bit.
The shuffle needn't be twice the original number of vector elements, so
the intermediate type used between the shuffle and the intrinsic should
use the ShuffleDstTy number of elements.
I found this when looking at shuffle costs and do not have test where it
alters the output, but have added some cases where the shuffle output is
not twice the size of the input.
This adds a test that exercises the part of scalarizeBinOpOrCmp that
produces immediate UB as described in
https://github.com/llvm/llvm-project/pull/138095#discussion_r2070133432,
but is fortunately currently folded into a correct transform.
I also noticed a bunch of immediate UB in some of the existing tests so
this also cleans them up. They should still all be scalarized though.
Since e39f6c1844fab59c638d8059a6cf139adb42279a opt will infer the
correct datalayout when given a triple. Avoid explicitly specifying it
in tests that depend on the AMDGPU target being present to avoid the
string becoming out of sync with the TargetInfo value.
Only tests with REQUIRES: amdgpu-registered-target or a local lit.cfg
were updated to ensure that tests for non-target-specific passes that
happen to use the AMDGPU layout still pass when building with a limited
set of targets.
Reviewed By: shiltian, arsenm
Pull Request: https://github.com/llvm/llvm-project/pull/137921
Previously only fixed vector splats were handled. This adds supports for
scalable vectors too by allowing ConstantExpr splats.
We need to add the extra V->getType()->isVectorTy() check because a
ConstantExpr might be a scalar to vector bitcast.
By allowing ConstantExprs this also allow fixed vector ConstantExprs to
be folded, which causes the diffs in
llvm/test/Analysis/ValueTracking/known-bits-from-operator-constexpr.ll
and llvm/test/Transforms/InstSimplify/ConstProp/cast-vector.ll. I can
remove them from this PR if reviewers would prefer.
Fixes#132922
In the previous code (#128032), it specified the destination vector as the
getShuffleCost argument. Because the shuffle mask specifies the indices
of the two vectors specified as elements, the maximum value is twice the
size of the source vector. This causes a problem if the destination
vector is smaller than the source vector and specify an index in the
mask that exceeds the size of the destination vector.
Fix the problem by correcting the previous code, which was using wrong
argument in the Cost calculation.
Fixes#130250
(shuffle(select(c1,t1,f1)), (select(c2,t2,f2)), m)
-> (select (shuffle c1,c2,m), (shuffle t1,t2,m), (shuffle f1,f2,m))
The behaviour of SelectInst on vectors is the same as for
`V'select[i] = Condition[i] ? V'True[i] : V'False[i]`.
If a ShuffleVector is performed on two selects, it will be like:
`V'[mask] = (V'select[i] = Condition[i] ? V'True[i] : V'False[i])`
That's why a ShuffleVector with two SelectInst is equivalent to
first ShuffleVector Condition/True/False and then SelectInst that
result.
This patch implements the transforming described above.
Proof: https://alive2.llvm.org/ce/z/97wfHpFixes#120775
Replace binary of of two reductions with one reduction of the binary op
applied to vectors. For example:
```
%v0_red = tail call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> %v0)
%v1_red = tail call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> %v1)
%res = add i32 %v0_red, %v1_red
```
gets transformed to:
```
%1 = add <16 x i32> %v0, %v1
%res = call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> %1)
```
`foldInsExtVectorToShuffle` function combines the extract/insert of a vector into a vector through a shuffle. However, we only supported coupling between vectors of the same size.
This commit allows combining extract/insert for vectors of the same type but with different sizes by converting the length of the vectors.
Proof: https://alive2.llvm.org/ce/z/ELNLr7
Fixed https://github.com/llvm/llvm-project/issues/120772
If we're interleaving 2 constant splats, for instance `<vscale x 8 x
i32> <splat of 666>` and `<vscale x 8 x i32> <splat of 777>`, we can
create a larger splat `<vscale x 8 x i64> <splat of ((777 << 32) |
666)>` first before casting it back into `<vscale x 16 x i32>`.
Add foldInsExtBinop fold to cleanup missed vectorization cases which can happen on targets with cheap insert/extract instructions which prevent foldExtractExtract (binop(extract(x),extract(y)) -> extract(binop(x,shuffle(y)))) from helping with the merge.
These were based off instruction count, not throughput - we can probably improve these further, but these throughput numbers match the worse expanded shuffles we see in the vector-shuffle-128-v* codegen tests.
If we're just moving a single element around inside a 128-bit lane (probably as an alternative to extracting it), we can assume this is cheap as a single PSRLDQ/PSHUFD/SHUFPS.
I've got the horrid feeling we're moving towards matching all SSE shuffle patterns inside the cost model, but I'm going to do my best to avoid this for now :|
CmpPredicate::getMatching implicitly assumes that both predicates are
integer-predicates, and this has led to a crash being reported in
VectorCombine after e409204 (VectorCombine: teach foldExtractedCmps
about samesign). FP predicates are simple enough to handle as there is
never any samesign information associated with them: hence handle them
in CmpPredicate::getMatching, fixing the VectorCombine crash and
guarding against future incorrect usages.
Follow up on 4a0d53a (PatternMatch: migrate to CmpPredicate) to get rid
of one of the FIXMEs it introduced by replacing a predicate comparison
with CmpPredicate::getMatching.
foldPermuteOfBinops currently requires both binop operands to be oneuse shuffles to fold the shuffles across the binop, but there will be cases where its still profitable to fold across the binop with only one foldable shuffle.
improveShuffleKindFromMask matches this as a SK_InsertSubvector of a v1f32 (which legalises to f32) into a v4f32 base vector, making it easy to recognise. MOVSS is limited to index0.
Avoid always assuming the worst for v4f32 2 input shuffles, and match the SHUFPS pattern where possible - each pair of output elements must come from the same source register.