Create truncate if needed after 56b05a0d6. Note that this preserves the
original behavior pre 56b05a0d6. If truncate would strip any set bits,
then the explicit computation in the narrower type would wrap.
This updates VPWidenPointerInductionRecipe::execute to not use the
canonical IV to determine the insert point. Instead, it relies on the
current recipe position. In cases where this is not sufficient, set the
insert point to the first non-phi instruction, to ensure phis are
created together.
Currently fast() won't return true if all flags are set via setXXX,
which is surprising. Update setters to set all bits if needed to make
sure isFast() consistently returns the expected result.
PR: https://github.com/llvm/llvm-project/pull/131321
These date back to when the non-intrinsic format of variable locations
was still being tested and was behind a compile-time flag, so not all
builds / bots would correctly run them. The solution at the time, to get
at least some test coverage, was to have tests opt-in to non-intrinsic
debug-info if it was built into LLVM.
Nowadays, non-intrinsic format is the default and has been on for more
than a year, there's no need for this flag to exist.
(I've downgraded the flag from "try" to explicitly requesting
non-intrinsic format in some places, so that we can deal with tests that
are explicitly about non-intrinsic format in their own commit).
Refactor the code to extract the first active element of a
vector in the early exit block, in preparation for PR #130766.
I've replaced the VPInstruction::ExtractFirstActive nodes with
a combination of a new VPInstruction::FirstActiveLane node and
a Instruction::ExtractElement node.
After #124093 we now support fixed-order recurrences with EVL tail
folding by replacing VPInstruction::FirstOrderRecurrenceSplice with a VP
splice intrinsic.
However the costing for the splice is currently done in
VPFirstOrderRecurrencePHIRecipe, so when we add the VP splice intrinsic
we end up costing it twice.
This fixes it by splitting out the cost for the splice into
FirstOrderRecurrenceSplice so that it's not duplicated when we replace
it.
We still have to keep the VF=1 checks in VPFirstOrderRecurrencePHIRecipe
since the splice might end up dead and discarded, e.g. in the test
@pr97452_scalable_vf1_for.
When visiting in-loop reduction links, we previously crashed if we had
an fmuladd with a blend after it in the chain. This fixes it by lifting
the existing blend folding to also handle fmuladd.
This also simplifies the code structure slightly for an upcoming patch I
want to post to handle in-loop AnyOf reductions.
I removed the PhiR->isInLoop() check since it's already guarded at the
top of the parent Header->Phis() loop.
Now that all phi nodes manage their incoming blocks through the
VPlan-predecessors, there should be no need for having a dedicate
recipe, it should be sufficient to allow PHI opcodes in VPInstruction.
Follow-ups will also migrate VPWidenPHIRecipe and possibly others,
building on top of https://github.com/llvm/llvm-project/pull/129388.
PR: https://github.com/llvm/llvm-project/pull/129767
This patch restricts broadcast operations from being hoisted to the vector
preheader unless the basic block that defines the broadcasted value properly
dominates the vector preheader.
This prevents potential use-before-definition issues when the broadcasted
value is defined within the plan. VPDominatorTree is used to confirm this
restriction while still allowing safe hoisting for broadcasted values defined
outside the plan.
Issue https://github.com/llvm/llvm-project/issues/117139
Update and generalize materializeBroadcasts to also introduce explicit
broadcasts for VPValues defined in the Plans Entry block.
This fixes a crash when trying to insert the broadcasts generated by
VPTransformState::get after the generating instruction, which isn't
possible after invoke instructions.
Fixes https://github.com/llvm/llvm-project/issues/128838.
Slightly generalize materializeLiveInBroadcasts to also introduce
broadcasts for live-ins used in the vector preheader. This should cover
all live-ins.
If the live-in is used in the vector preheader, insert the broadcast at
the beginning of the block.
Following on from #125058, this patch takes into account the
work done in the vector early exit block when assessing the
profitability of vectorising the loop. I have renamed
areRuntimeChecksProfitable to isOutsideLoopWorkProfitable and
we now pass in the early exit costs. As part of this, I have
added the ExtractFirstActive opcode to VPInstruction::computeCost.
It's worth pointing out that when we assess profitability of the
loop we calculate a minimum trip count and compare that against
the *maximum* trip count. However, since the loop has an early
exit the runtime trip count can still end up being less than the
minimum. Alternatively, we may never take the early exit at all
at runtime and so we have the opposite problem of over-estimating
the cost of the loop. The loop vectoriser cannot simultaneously
take two contradictory positions and so I feel the only sensible
thing to do is be conservative and assume the loop will be more
expensive than loops without early exits.
We may find in future that we need to adjust the cost according to
the probability of taking the early exit. This will become even
more important once we support multiple early exits. However, we
have to start somewhere and we can always revisit this later.
This is the first of two PRs that attempts to improve the IR
generated in the exit blocks of vectorised loops with uncountable
early exits. In this PR I am improving the generated code for
users of induction variables in early exit loops that have a
unique exit block, when exiting via the latch.
I have moved some of the code for calculating the exit values in
latch exit blocks from `optimizeInductionExitUsers` into a new
function `optimizeLatchExitInductionUser`.
I intend to follow this up very soon with another patch to
optimise the code for induction users in the vector.early.exit
block.
Assert that we only generate runtime checks for inner loops in
emitMemRuntimeChecks, instead of returning nullptr in the VPlan-native
path, which is causing crashes and incorrect code.
This patch revises the cost model for sdiv/srem and draws its inspiration from the udiv/urem patch #122236
The typical codegen for the different scenarios has been mentioned as notes/comments in the code itself( this is done owing to lot of scenarios such that it would be difficult to mention them here in the patch description).
Create an empty VPlan first, then let the HCFG builder create a plain
CFG for the top-level loop (w/o a top-level region). The top-level
region is introduced by a separate VPlan-transform. This is instead of
creating the vector loop region before building the VPlan CFG for the
input loop.
This simplifies the HCFG builder (which should probably be renamed) and
moves along the roadmap ('buildLoop') outlined in [1].
As follow-up, I plan to also preserve the exit branches in the initial
VPlan out of the CFG builder, including connections to the exit blocks.
The conversion from plain CFG with potentially multiple exits to a
single entry/exit region will be done as VPlan transform in a follow-up.
This is needed to enable VPlan-based predication. Currently early exit
support relies on building the block-in masks on the original CFG,
because exiting branches and conditions aren't preserved in the VPlan.
So in order to switch to VPlan-based predication, we will have to
preserve them in the initial plain CFG, so the exit conditions are
available explicitly when we convert to single entry/exit regions.
Another follow-up is updating the outer loop handling to also introduce
VPRegionBlocks for nested loops as transform. Currently the existing
logic in the builder will take care of creating VPRegionBlocks for
nested loops, but not the top-level loop.
[1]
https://llvm.org/devmtg/2023-10/slides/techtalks/Hahn-VPlan-StatusUpdateAndRoadmap.pdf
PR: https://github.com/llvm/llvm-project/pull/128419
emitSCEVChecks checks if SCEVCheckCond matches zero, and returns
nullptr. However, it sets SCEVCheckCond as used before it does this,
which prevents it from being removed during cleanup, resulting in
unreachable blocks being emitted. Fix this.
getLoopEstimatedTripCount returns std::nullopt when the trip count would
overflow the return type, but since it is an estimate anyway, we might
as well saturate at UINT_MAX, improving results.
If an in-loop reduction is chained e.g.
WIDEN-REDUCTION-PHI ir<%rdx> = phi ir<0>, ir<%add2>
REDUCE ir<%add1> = ir<%rdx> + reduce.add (ir<%x>)
REDUCE ir<%add2> = ir<%add1> + reduce.add (ir<%y>)
When we try to unroll the second add reduction, we crash because we
currently expect the chain to be a VPReductionPHIRecipe, when in fact
it's the previous reduction. This relaxes the cast to a dyn_cast, so we
end up unrolling to:
WIDEN-REDUCTION-PHI ir<%rdx> = phi ir<0>, ir<%add2>
WIDEN-REDUCTION-PHI ir<%rdx>.1 = phi ir<0>, ir<%add2>.1, ir<1>
WIDEN-REDUCTION-PHI ir<%rdx>.2 = phi ir<0>, ir<%add2>.2, ir<2>
WIDEN-REDUCTION-PHI ir<%rdx>.3 = phi ir<0>, ir<%add2>.3, ir<3>
REDUCE ir<%add1> = ir<%rdx> + reduce.add (ir<%x>)
REDUCE ir<%add1>.1 = ir<%rdx>.1 + reduce.add (ir<%x>.1)
REDUCE ir<%add1>.2 = ir<%rdx>.2 + reduce.add (ir<%x>.2)
REDUCE ir<%add1>.3 = ir<%rdx>.3 + reduce.add (ir<%x>.3)
REDUCE ir<%add2> = ir<%add1> + reduce.add (ir<%y>)
REDUCE ir<%add2>.1 = ir<%add1>.1 + reduce.add (ir<%y>.1)
REDUCE ir<%add2>.2 = ir<%add1>.2 + reduce.add (ir<%y>.2)
REDUCE ir<%add2>.3 = ir<%add1>.3 + reduce.add (ir<%y>.3)
This fixes a crash when building 525.x264_r from SPEC CPU 2017 on
AArch64 with -mllvm -prefer-inloop-reductions
The old hack of returning v5/v6i32 for the fat and strided buffer
pointers was causing issuse during vectorization queries that expected
to be able to construct a VectorType from the return value of `MVT
getPointerType()`. On example is in the test attached to this PR, which
used to crash.
Now, we define the custom MVT entries, the 160-bit
amdgpuBufferFatPointer and 192-bit amdgpuBufferStridedPointer, which are
used to represent ptr addrspace(7) and ptr addrspace(9) respectively.
Neither of these types will be present at the time of lowering to a
SelectionDAG or other MIR - MVT::amdgpuBufferFatPointer is eliminated by
the LowerBufferFatPointers pass and amdgpu::bufferStridedPointer is not
currently used outside of the SPIR-V translator (which does its own
lowering).
An alternative solution would be to add MVT::i160 and MVT::i192. We
elect not to do this now as it would require changes to unrelated code
and runs the risk of breaking any SelectionDAG code that assumes that
the MVT series are all powers of two (and so can be split apart and
merged back together) in ways that wouldn't be obvious if someone tried
to use MVT::i160 in codegen. If i160 is added at some future point,
these custom types can be retired.
This PR enable scalable loop vectorization for fmax and fmin reductions
with f16/bf16 type when only zvfhmin/zvfbfmin are enabled.
After https://github.com/llvm/llvm-project/pull/128800, we can promote
the fmax/fmin reductions with f16/bf16 type to f32 reductions for
zvfhmin/zvfbfmin.
getLoopEstimatedTripCount returns the trip count based on profiling
data, and its documentation says that it could return 0 when the trip
count is zero, but this is not the case: a valid trip count can never be
zero, and it returns 0 when the unsigned ExitCount is incremented by 1
and wraps. Some callers are careful about checking for this zero value
in an std::optional, but it makes for an API with footguns, as a
std::optional return value indicates that a non-nullopt value would be a
valid trip count. Fix this by explicitly returning std::nullopt when the
return value would wrap, and strip additional checks in callers. This
also fixes a minor bug in LoopVectorize.
This patch converts the llvm.vector.splice intrinsic to
llvm.experimental.vp.splice, ensuring that fixed-order recurrences
execute correctly when tail folding by EVL is enable.
Due to the non-VFxUF penultimate EVL issue, the EVL from the previous
iteration will be preserved and used in llvm.experimental.vp.splice.
Follow on to #128035. It is a small extension to support vectorizing
`llvm.modf.*` and `llvm.sincospi.*` too.
This renames the test files from `sincos.ll` ->
`multiple-result-intrinsics.ll` to group together the similar tests
(which make up most of this PR).
Functions marked with minsize should aim for minimum code size, so the
vectorizer should use CodeSize for the cost kind and also the cost we
compare should be the cost for the entire loop: it shouldn't be divided
by the number of vector elements and block costs shouldn't be divided by
the block probability.
Possibly we should also be doing this for optsize as well, but there are
a lot of tests that assume the current behaviour and the definition of
optsize is less clear than minsize (for minsize the goal is to "keep the
code size of this function as small as possible" whereas for optsize
it's "keep the code size of this function low").
Add a new VPInstruction::Broadcast opcode and use it to materialize
explicit broadcasts of live-ins. The initial patch only materlizes the
broadcasts if the vector preheader dominates all uses that need it.
Later patches will pick the best valid insert point, thus retiring
implicit hoisting of broadcasts from VPTransformsState::get().
PR: https://github.com/llvm/llvm-project/pull/124644
This is a copy of #126177, since it was automatically and permanently
closed because I messed up the source branch on my remote
This patch proposes to avoid converting widening recipes to VP
intrinsics during the EVL transform.
IIUC we initially did this to avoid `vl` toggles on RISC-V. However we
now have the RISCVVLOptimizer pass which mostly makes this redundant.
Emitting regular IR instead of VP intrinsics allows more generic
optimisations, both in the middle end and DAGCombiner, and we generally
have better patterns in the RISC-V backend for non-VP nodes. Sticking to
regular IR instructions is likely a lot less work than reimplementing
all of these optimisations for VP intrinsics, and on SPEC CPU 2017 we get
noticeably better code generation.
Currently we only check if the pointers involved in runtime checks do
not wrap if we need to perform dependency checks. If that's not the
case, we generate runtime checks, even if the pointers may wrap (see
test/Analysis/LoopAccessAnalysis/runtime-checks-may-wrap.ll).
If the pointer wraps, then we swap start and end of the runtime check,
leading to incorrect checks.
An Alive2 proof of what the runtime checks are checking conceptually (on
i4 to have it complete in reasonable time) showing the incorrect result
should be https://alive2.llvm.org/ce/z/KsHzn8
Depends on https://github.com/llvm/llvm-project/pull/127410 to avoid
more regressions.
PR: https://github.com/llvm/llvm-project/pull/127543
The assertion was left over from a time when VPBBs still had an
associated condition bit. This is not the case any more (comment was
stale). In case a branch on condition is needed, a BranchOnCond
VPInstruction is added when constructing recipes. That's also where it
is checked if the condition is available.
Exposed by 38376dee9.
The mapping of IR ExitBB to a VPBB isn't used. It also sets an incorrect
VPBB for the ExitBB; the regions successor is the middle block, no the
exit block.
It also unnecessarily triggers an assertion after 38376dee922.