Update GetIndexRangeInShuffles to skip unused shuffles. This matches the
behavior in the loop below and without it, we end up with an index
mis-match, causing a crash for the added test case.
PR: https://github.com/llvm/llvm-project/pull/179217
It appears to create loads from underaligned addresses. See comment on
the PR.
> Following on from #128938, trim the low end of loads where only some of
> the incoming lanes are used for rebroadcasts in shufflevector
> instructions.
This reverts commit 6c8d9d0c4da51c7f9e7671902be3ad9b65d56c84 and the
follow-up commits 07a6a23f6c5387fc1e7df174b5921f6004db64e0 and
313a2008538abb61ab13f8cc9f9a712f7faff3a5.
This PR fixes an [issue I pointed out in regards to incorrect GEP
indices](https://github.com/llvm/llvm-project/pull/149093#discussion_r2748266079)
introduced by PR #149093.
Changes:
- Updated the pointer offset calculation in
`VectorCombine::shrinkLoadForShuffles` so that the offset is now
multiplied by the element size (`ElemSize`) when computing the new
pointer for loads
- Updated the GEP indices in
`llvm/test/Transforms/VectorCombine/load-shufflevector.ll` for the
correct byte offsets
## Summary
Fixes assertion failure when `shrinkLoadForShuffles` processes shuffle
masks containing poison elements.
The bug was introduced in #149093 , when adjusting mask indices for load
trimming, poison indices (-1) were modified to invalid values (e.g.,
-2), causing `isSingleSourceMaskImpl` to assert.
The fix preserves poison indices without modification.
Fixes#178917
## Test plan
- Added regression test `@shuffle_with_poison_mask`
Following on from #128938, trim the low end of loads where only some of
the incoming lanes are used for rebroadcasts in shufflevector
instructions.
---------
Co-authored-by: Leon Clark <leoclark@amd.com>
Co-authored-by: Simon Pilgrim <llvm-dev@redking.me.uk>
Addresses an issue in #173153. This patch expanded the supported ops for
folding binary ops through shuffles, but seemingly had a typo which
could inaccurately increase the unmodified cost.
Fixes#177144. Nits appreciated.
The fold in question does the following transformation:
Before
```
%bc = bitcast <4 x i32> %src to <16 x i8>
%e0 = extractelement <16 x i8> %bc, i32 0
%s0 = select i1 %cond, i8 %e0, i8 0
%e1 = extractelement <16 x i8> %bc, i32 1
%s1 = select i1 %cond, i8 %e1, i8 0
...
```
After
```
%sel = select i1 %cond, <4 x i32> %src, <4 x i32> zeroinitializer
%bc = bitcast <4 x i32> %sel to <16 x i8>
%e0 = extractelement <16 x i8> %bc, i32 0
%e1 = extractelement <16 x i8> %bc, i32 1
...
```
If every select shares the condition and has 0 in the false branch, the
bitcast can be replaced with a select between the original vector and
`zeroinitializer`, followed by a bitcast. Then each `select(cond,
extelt(...), 0)` can be replaced with `extelt(...)`.
The crash happens when the condition is defined after the original
bitcast, because the bitcast is replaced with the select + bitcast, and
now the select references a condition not yet defined.
#175194, #177159, and #173069 introduced the code calling
`TTI.getMinMaxReductionCost` with unexpected `Intrinsic::ID` causing
RISC-V to fail with `llvm_unreachable` panic.
Functionally, this is a small fix that also ports tests for the
aforementioned folds to RISCV.
Resolves#170500.
Implemented mergeInfo static helper to return common
TTI::OperandValueInfo data .
Added common OperandValueInfo `Op0Info` && `Op1Info` to NewCost
calculation.
This patch removes the single-use restriction of selects in
foldShuffleOfSelects, allowing the fold to trigger for multi-use
instructions as well if the cost model finds it cheaper.
Fixes#173036
Fixes#173037
Remove the `m_OneUse` restriction in `foldShuffleOfIntrinsics` and
update the cost model to account for additional uses of the original intrinsics.
Fixes#173033
This patch extends VectorCombine to fold binary operations through
shuffles in scenarios involving multiple uses of both the binary
operator and its operands.
Previously, the transformation was restricted to single-use cases to
prevent instruction duplication. This change implements a cost-based
evaluation that allows the fold even when:
1. The binary operator has multiple users (requiring duplication of the
arithmetic instruction).
2. The operands of the binary operator (the shuffles) have multiple
users (requiring the original shuffles to be preserved).
The optimization is performed if the TTI cost of the new instruction
sequence—including any duplicated arithmetic—is lower than the cost of
the shuffle sequence it replaces. This is particularly beneficial on X86
targets for expensive cross-lane shuffles.
If we're shuffling/concatenating the same operands then ensure we don't
duplicate the total cost, ensure we reuse the final shuffle and
recognise that we reduce the total instruction count (so fold even when
NewCost == OldCost, not just NewCost < OldCost).
Keeping the extracted element in a natural position in the narrowed
vector has two beneficial effects:
1. It makes the narrowing shuffles cheaper (at least on AMDGPU), which
allows the insert/extract fold to trigger.
2. It makes the narrowing shuffles in a chain of extract/insert
compatible, which allows foldLengthChangingShuffles to successfully
recognize a chain that can be folded.
There are minor X86 test changes that look reasonable to me. The IR
change for AVX2 in
llvm/test/Transforms/VectorCombine/X86/extract-insert-poison.ll
doesn't change the assembly generated by `llc -mtriple=x86_64--
-mattr=AVX2`
at all.
This pr resolves [#170867](https://github.com/llvm/llvm-project/issues/170867)
Existing code recomputes the cost for creating a shuffle instruction even for the
repeating Intrinsic operand pairs. This will result in higher newCost.
Hence the runtime will decide not to fold.
The change proposed in this pr will address this issue. When calculating
the newCost we are skipping the cost calculation of an operand pair if
it was already considered. And when creating the transformed code, we
are reusing the already created shuffle instruction for repeated operand
pair.
[VectorCombine] Fold permute of intrinsics into intrinsic of permutes
Add foldPermuteOfIntrinsic to transform:
shuffle(intrinsic(args), poison) -> intrinsic(shuffle(args))
when the shuffle is a permute (operates on single vector) and the cost
model determines the transformation is profitable.
This optimization is particularly beneficial for subvector extractions
where we can avoid computing unused elements.
For example:
%fma = call <8 x float> @llvm.fma.v8f32(<8 x float> %a, %b, %c)
%result = shufflevector <8 x float> %fma, poison, <4 x i32> <0,1,2,3>
transforms to:
%a_low = shufflevector <8 x float> %a, poison, <4 x i32> <0,1,2,3>
%b_low = shufflevector <8 x float> %b, poison, <4 x i32> <0,1,2,3>
%c_low = shufflevector <8 x float> %c, poison, <4 x i32> <0,1,2,3>
%result = call <4 x float> @llvm.fma.v4f32(%a_low, %b_low, %c_low)
The transformation creates one shuffle per vector argument and calls the
intrinsic with smaller vector types, reducing computation when only a
subset of elements is needed.
The existing foldShuffleOfIntrinsics handled the blend case (two
intrinsic inputs), this adds support for the permute case (single
intrinsic input).
Fixes#170002
Ensure the arguments are passed to the getShuffleCost calls to improve
cost analysis, in particular if these are constant the costs will be
recognised as free
Noticed while reviewing #170052
This change aims to convert vector loads to scalar loads, if they are
only converted to scalars after anyway.
alive2 proof: https://alive2.llvm.org/ce/z/U_rvht
insertelt DestVec, (fneg (extractelt SrcVec, Index)), Index
-> shuffle DestVec, (shuffle (fneg SrcVec), poison, SrcMask), Mask
In previous, the above transform was only possible if the Extract/Insert
Index was the same; this patch makes the above transform possible
even if the two indexes are different.
Proof: https://alive2.llvm.org/ce/z/aDfdyG
Fixes: https://github.com/llvm/llvm-project/issues/125675
This change aims to avoid inserting a freeze instruction between the
load and bitcast when scalarizing extend-extract. This is particularly
useful in combination with
https://github.com/llvm/llvm-project/pull/164682, which can then
potentially further scalarize, provided there is no freeze.
alive2 proof: https://alive2.llvm.org/ce/z/W-GD88
The scalarizeExtExtract transform assumed little-endian lane ordering,
causing miscompiles on big-endian targets such as AIX/PowerPC under -O3
-flto.
This patch updates the shift calculation to handle endianness correctly
for big-endian targets. No functional change
for little-endian targets.
Fixes#158197.
---------
Co-authored-by: Simon Pilgrim <llvm-dev@redking.me.uk>
This patch addresses
https://github.com/llvm/llvm-project/pull/155216#discussion_r2297724663.
This patch adds a helper function to put the inverse cast on constants,
with cast flags preserved(optional).
Follow-up patches will add trunc/ext handling on VectorCombine and flags
preservation on InstCombine.
Technically this could happen with vector units that can't handle all legal scalar widths - but its good enough to use a generic crash test without a suitable target
Fixes#157335
Resolves#154797.
This patch adds the fold `bitop(bitcast(x), C) -> bitop(bitcast(x),
cast(InvC)) -> bitcast(bitop(x, InvC))`.
The helper function `getLosslessInvCast` tries to calculate the constant
`InvC`, satisfying `castop(InvC) == C`, and will try its best to keep
the poison-generated flags of the cast operation.
Consider the following pattern:
```
C = op A B
D = op C
E = op D, C
```
As `E` is dead, we call `eraseInstruction(E)` and see if its operands
become dead. `RecursivelyDeleteTriviallyDeadInstructions(D)` also erases
`C`, which causes a UAF crash in the subsequent call
`RecursivelyDeleteTriviallyDeadInstructions(C)`.
This patch also adds deleted ops into the visit list to avoid double
deletion.
Closes https://github.com/llvm/llvm-project/issues/155543.
`RecursivelyDeleteTriviallyDeadInstructions` is introduced by
https://github.com/llvm/llvm-project/pull/149047 to immediately drop
dead instructions. However, it may invalidate the next iterator in
`make_early_inc_range` in some edge cases, which leads to a crash. This
patch manually maintains the next iterator and updates it when the next
instruction is about to be deleted.
Closes https://github.com/llvm/llvm-project/issues/155110.