getLoopEstimatedTripCount returns std::nullopt when the trip count would
overflow the return type, but since it is an estimate anyway, we might
as well saturate at UINT_MAX, improving results.
If an in-loop reduction is chained e.g.
WIDEN-REDUCTION-PHI ir<%rdx> = phi ir<0>, ir<%add2>
REDUCE ir<%add1> = ir<%rdx> + reduce.add (ir<%x>)
REDUCE ir<%add2> = ir<%add1> + reduce.add (ir<%y>)
When we try to unroll the second add reduction, we crash because we
currently expect the chain to be a VPReductionPHIRecipe, when in fact
it's the previous reduction. This relaxes the cast to a dyn_cast, so we
end up unrolling to:
WIDEN-REDUCTION-PHI ir<%rdx> = phi ir<0>, ir<%add2>
WIDEN-REDUCTION-PHI ir<%rdx>.1 = phi ir<0>, ir<%add2>.1, ir<1>
WIDEN-REDUCTION-PHI ir<%rdx>.2 = phi ir<0>, ir<%add2>.2, ir<2>
WIDEN-REDUCTION-PHI ir<%rdx>.3 = phi ir<0>, ir<%add2>.3, ir<3>
REDUCE ir<%add1> = ir<%rdx> + reduce.add (ir<%x>)
REDUCE ir<%add1>.1 = ir<%rdx>.1 + reduce.add (ir<%x>.1)
REDUCE ir<%add1>.2 = ir<%rdx>.2 + reduce.add (ir<%x>.2)
REDUCE ir<%add1>.3 = ir<%rdx>.3 + reduce.add (ir<%x>.3)
REDUCE ir<%add2> = ir<%add1> + reduce.add (ir<%y>)
REDUCE ir<%add2>.1 = ir<%add1>.1 + reduce.add (ir<%y>.1)
REDUCE ir<%add2>.2 = ir<%add1>.2 + reduce.add (ir<%y>.2)
REDUCE ir<%add2>.3 = ir<%add1>.3 + reduce.add (ir<%y>.3)
This fixes a crash when building 525.x264_r from SPEC CPU 2017 on
AArch64 with -mllvm -prefer-inloop-reductions
The old hack of returning v5/v6i32 for the fat and strided buffer
pointers was causing issuse during vectorization queries that expected
to be able to construct a VectorType from the return value of `MVT
getPointerType()`. On example is in the test attached to this PR, which
used to crash.
Now, we define the custom MVT entries, the 160-bit
amdgpuBufferFatPointer and 192-bit amdgpuBufferStridedPointer, which are
used to represent ptr addrspace(7) and ptr addrspace(9) respectively.
Neither of these types will be present at the time of lowering to a
SelectionDAG or other MIR - MVT::amdgpuBufferFatPointer is eliminated by
the LowerBufferFatPointers pass and amdgpu::bufferStridedPointer is not
currently used outside of the SPIR-V translator (which does its own
lowering).
An alternative solution would be to add MVT::i160 and MVT::i192. We
elect not to do this now as it would require changes to unrelated code
and runs the risk of breaking any SelectionDAG code that assumes that
the MVT series are all powers of two (and so can be split apart and
merged back together) in ways that wouldn't be obvious if someone tried
to use MVT::i160 in codegen. If i160 is added at some future point,
these custom types can be retired.
This PR enable scalable loop vectorization for fmax and fmin reductions
with f16/bf16 type when only zvfhmin/zvfbfmin are enabled.
After https://github.com/llvm/llvm-project/pull/128800, we can promote
the fmax/fmin reductions with f16/bf16 type to f32 reductions for
zvfhmin/zvfbfmin.
getLoopEstimatedTripCount returns the trip count based on profiling
data, and its documentation says that it could return 0 when the trip
count is zero, but this is not the case: a valid trip count can never be
zero, and it returns 0 when the unsigned ExitCount is incremented by 1
and wraps. Some callers are careful about checking for this zero value
in an std::optional, but it makes for an API with footguns, as a
std::optional return value indicates that a non-nullopt value would be a
valid trip count. Fix this by explicitly returning std::nullopt when the
return value would wrap, and strip additional checks in callers. This
also fixes a minor bug in LoopVectorize.
This patch converts the llvm.vector.splice intrinsic to
llvm.experimental.vp.splice, ensuring that fixed-order recurrences
execute correctly when tail folding by EVL is enable.
Due to the non-VFxUF penultimate EVL issue, the EVL from the previous
iteration will be preserved and used in llvm.experimental.vp.splice.
Follow on to #128035. It is a small extension to support vectorizing
`llvm.modf.*` and `llvm.sincospi.*` too.
This renames the test files from `sincos.ll` ->
`multiple-result-intrinsics.ll` to group together the similar tests
(which make up most of this PR).
Functions marked with minsize should aim for minimum code size, so the
vectorizer should use CodeSize for the cost kind and also the cost we
compare should be the cost for the entire loop: it shouldn't be divided
by the number of vector elements and block costs shouldn't be divided by
the block probability.
Possibly we should also be doing this for optsize as well, but there are
a lot of tests that assume the current behaviour and the definition of
optsize is less clear than minsize (for minsize the goal is to "keep the
code size of this function as small as possible" whereas for optsize
it's "keep the code size of this function low").
Add a new VPInstruction::Broadcast opcode and use it to materialize
explicit broadcasts of live-ins. The initial patch only materlizes the
broadcasts if the vector preheader dominates all uses that need it.
Later patches will pick the best valid insert point, thus retiring
implicit hoisting of broadcasts from VPTransformsState::get().
PR: https://github.com/llvm/llvm-project/pull/124644
This is a copy of #126177, since it was automatically and permanently
closed because I messed up the source branch on my remote
This patch proposes to avoid converting widening recipes to VP
intrinsics during the EVL transform.
IIUC we initially did this to avoid `vl` toggles on RISC-V. However we
now have the RISCVVLOptimizer pass which mostly makes this redundant.
Emitting regular IR instead of VP intrinsics allows more generic
optimisations, both in the middle end and DAGCombiner, and we generally
have better patterns in the RISC-V backend for non-VP nodes. Sticking to
regular IR instructions is likely a lot less work than reimplementing
all of these optimisations for VP intrinsics, and on SPEC CPU 2017 we get
noticeably better code generation.
Currently we only check if the pointers involved in runtime checks do
not wrap if we need to perform dependency checks. If that's not the
case, we generate runtime checks, even if the pointers may wrap (see
test/Analysis/LoopAccessAnalysis/runtime-checks-may-wrap.ll).
If the pointer wraps, then we swap start and end of the runtime check,
leading to incorrect checks.
An Alive2 proof of what the runtime checks are checking conceptually (on
i4 to have it complete in reasonable time) showing the incorrect result
should be https://alive2.llvm.org/ce/z/KsHzn8
Depends on https://github.com/llvm/llvm-project/pull/127410 to avoid
more regressions.
PR: https://github.com/llvm/llvm-project/pull/127543
The assertion was left over from a time when VPBBs still had an
associated condition bit. This is not the case any more (comment was
stale). In case a branch on condition is needed, a BranchOnCond
VPInstruction is added when constructing recipes. That's also where it
is checked if the condition is available.
Exposed by 38376dee9.
The mapping of IR ExitBB to a VPBB isn't used. It also sets an incorrect
VPBB for the ExitBB; the regions successor is the middle block, no the
exit block.
It also unnecessarily triggers an assertion after 38376dee922.
Use HCFGBuilder to build an initial VPlan 0, which wraps all input
instructions in VPInstructions and update tryToBuildVPlanWithVPRecipes
to replace the VPInstructions with widened recipes.
At the moment, widened recipes are created based on the underlying
instruction of the VPInstruction. Masks are also still created based on
the input IR basic blocks and the loop CFG is flattened in the main loop
processing the VPInstructions.
This patch also incldues support for Switch instructions in HCFGBuilder
using just a VPInstruction with Instruction::Switch opcode.
There are multiple follow-ups planned:
* Perform predication on the VPlan directly,
* Unify code constructing VPlan 0 to be shared by both inner and outer
loop code paths.
* Construct VPlan 0 once, clone subsequent ones for VFs
PR: https://github.com/llvm/llvm-project/pull/124432
Update VPWidenPHIRecipe to use the predecessors in VPlan to determine
the incoming blocks instead of tracking them separately. This brings
VPWidenPHIRecipe in line with the other phi recipes.
PR: https://github.com/llvm/llvm-project/pull/126388
This patch adds initial support for vectorizing literal struct return
values. Currently, this is limited to the case where the struct is
homogeneous (all elements have the same type) and not packed. The users
of the call also must all be `extractvalue` instructions.
The intended use case for this is vectorizing intrinsics such as:
```
declare { float, float } @llvm.sincos.f32(float %x)
```
Mapping them to structure-returning library calls such as:
```
declare { <4 x float>, <4 x float> } @Sleef_sincosf4_u10advsimd(<4 x float>)
```
Or their widened form (such as `@llvm.sincos.v4f32` in this case).
Implementing this required two main changes:
1. Supporting widening `extractvalue`
2. Adding support for vectorized struct types in LV
* This is mostly limited to parts of the cost model and scalarization
Since the supported use case is narrow, the required changes are
relatively small.
Create a IR BB directly for the middle.block, instead of creating the IR
BB during skeleton creation and then replacing the middle VPBB with a
VPIRBB.
This moves another part of skeleton creation to VPlan and simplififes
the code slightly by removing code to disconnect the middle block and
vector preheader + the corresponding DT update.
NFC modulo IR block naming and block creation order, which changes the
IR names for the blocks.
Test that we do not vectorize the loop using folding by EVL, when a
fixed-order recurrence has external users.
TODO: Support external users by extractelement the EVL-th lane.
Update LangRef and code using `Dereferenceable` in assume bundles to
only use the information if it is safe at the point of use.
`Dereferenceable` in an assume bundle is only guaranteed at the point of
the assumption, but may not be guaranteed at later points, because the
pointer may have been freed.
Update code using `Dereferenceable` to only use it if the pointer cannot
be freed. This can further be refined to check if the pointer could be
freed between assume and use.
This follows up on https://github.com/llvm/llvm-project/pull/123196.
With that change, it should be safe to expose dereferenceable
assumptions more widely as in
https://github.com/llvm/llvm-project/pull/121789
PR: https://github.com/llvm/llvm-project/pull/126117
NEON does not have a version of udot/sdot that accumulates into
64-bit integer values, so we should return Invalid from
getPartialReductionCost for 64-bit types and fixed-width VFs.
In theory, if the 64-bit versions of SVE udot/sdot are available
we could use those, but we don't currently have lowering support
for that.
Update getOrCreateVPValueForSCEVExpr to only skip expansion of
SCEVUnknown if the underlying value isn't an instruction. Instructions
may be defined in a loop and using them without expansion may break
LCSSA form. SCEVExpander will take care of preserving LCSSA if needed.
We could also try to pass LoopInfo, but there are some users of the
function where it won't be available and main benefit from skipping
expansion is slightly more concise VPlans.
Note that SCEVExpander is now used to expand SCEVUnknown with floats.
Adjust the check in expandCodeFor to only check the types and casts if
the type of the value is different to the requested type. Otherwise we
crash when trying to expand a float and requesting a float type.
Fixes https://github.com/llvm/llvm-project/issues/121518.
PR: https://github.com/llvm/llvm-project/pull/125235
`forgetLcssaPhiWithNewPredecessor` performs additional invalidation if
there is an existing SCEV for the phi, but earlier
`forgetBlockAndLoopDispositions` or `forgetLoop` may already invalidate
the SCEV for the phi.
Change the order to first call `forgetLcssaPhiWithNewPredecessor` to
ensure it runs before its SCEV gets invalidated too eagerly.
Fixes https://github.com/llvm/llvm-project/issues/119665.
PR: https://github.com/llvm/llvm-project/pull/119897
* Adds variants of dotp (dotp_i8_to_i64_has_neon_dotprod,
dotp_i16_to_i64_has_neon_dotprod) that show how the loop
vectoriser has generated fixed-width partial reductions
without any matching NEON udot instruction.
* Adds loops that could also benefit from partial
reductions once the work is done to recognise patterns
such as
%zext = zext i8 %load to i32
%acc.next = add i32 %acc, %zext
See zext_add_reduc_i8_i32, etc. I intend to follow up with
a patch to add support for vectorising such patterns.
This patch relands the changes from "[LV]: Teach LV to recursively
(de)interleave.#122989"
Reason for revert:
- The patch exposed an assert in the vectorizer related to VF difference
between
legacy cost model and VPlan-based cost model because of uncalculated
cost for
VPInstruction which is created by VPlanTransforms as a replacement to
'or disjoint'
instruction.
VPlanTransforms do that instructions change when there are memory
interleaving and
predicated blocks, but that change didn't cause problems because at most
cases the cost
difference between legacy/new models is not noticeable.
- Issue is fixed by #125434
Original patch: https://github.com/llvm/llvm-project/pull/89018
Reviewed-by: paulwalker-arm, Mel-Chen
Follow-up as discussed when using VPInstruction::ResumePhi for all resume
values (#112147). This patch explicitly adds incoming values for each
predecessor in VPlan. This simplifies codegen and allows transformations
adjusting the predecessors of blocks with
NFC modulo incoming block order in phis.
The legacy and vplan cost models did not agree because
VPWidenCallRecipe::computeCost only calculates the cost of the
call instruction, whereas
LoopVectorizationCostModel::setVectorizedCallDecision in some
cases adds on the cost of a synthesised mask argument. However,
this mask is always 'splat(i1 true)' which should be hoisted out
of the loop during codegen. In order to synchronise the two cost
models I have two options:
1) Also add the cost of the splat to the vplan model, or
2) Remove the cost of the splat from the legacy model.
I chose 2) because I feel this more closely represents what the
final code will look like. There is an argument that we should
take account of such broadcast costs in the preheader when
deciding if it's profitable to vectorise a loop, however there
isn't currently a mechanism to do this. We currently only take
account of the runtime checks when assessing profitability and
what the minimum trip count should be. However, I don't believe
this work needs doing as part of this PR.
The current legality check for folding with EVL has incomplete
verification for VF.
This patch fixes the VF check, ensuring that tail folding with EVL is
enabled only when a scalable VF is available. This allows loops that
prefer tail folding with EVL but cannot use scalable VF vectorization to
still be vectorized using a fixed VF, rather than abandoning
vectorization entirely.
This patch adds an initial implementation of
VPInstruction::computeCost with support for only one
instruction so far - VPInstruction::AnyOf. This is only
used when vectorising loops with uncountable early exits.
This patch updates the cost model for fmuladd on vector types to scale with LMUL. This was found when analyzing a hot loop in 519.lbm_r that was unprofitably vectorized, but doesn't directly impact that case and is split off so it doesn't get forgotten.
Unlike other FP arithmetic ops, it's not scaled by 2 because the scalar cost isn't scaled by 2.
The plans with scalar VF should not be transformed the plans folded by
EVL.
TODO: Move the scalar VF checking into `LoopVectorizationCostModel
::foldTailWithEVL()`.