As part of the "RemoveDIs" work to eliminate debug intrinsics, we're
replacing methods that use Instruction*'s as positions with iterators. A
number of these (such as getFirstNonPHIOrDbg) are sufficiently
infrequently used that we can just replace the pointer-returning version
with an iterator-returning version, hopefully without much/any
disruption.
Thus this patch has getFirstNonPHIOrDbg and
getFirstNonPHIOrDbgOrLifetime return an iterator, and updates all
call-sites. There are no concerns about the iterators returned being
converted to Instruction*'s and losing the debug-info bit: because the
methods skip debug intrinsics, the iterator head bit is always false
anyway.
As part of the "RemoveDIs" project, BasicBlock::iterator now carries a
debug-info bit that's needed when getFirstNonPHI and similar feed into
instruction insertion positions. Call-sites where that's necessary were
updated a year ago; but to ensure some type safety however, we'd like to
have all calls to moveBefore use iterators.
This patch adds a (guaranteed dereferenceable) iterator-taking
moveBefore, and changes a bunch of call-sites where it's obviously safe
to change to use it by just calling getIterator() on an instruction
pointer. A follow-up patch will contain less-obviously-safe changes.
We'll eventually deprecate and remove the instruction-pointer
insertBefore, but not before adding concise documentation of what
considerations are needed (very few).
The function getPartialReductionCost is already quite large and
is likely to grow in size as we add support for more cases in
future. Therefore, I think it's best to move this into the cpp
file.
This should not effect the result, unless the getArithmeticInstrCost and
getVectorInstrCost routines learn to produce different costs (with CostKind =
CodeSize for example). The -1 lanes prevent 0 lanes from (incorrectly) being
marked as free.
To deduce whether the optimization is legal we need to compare the target
features between caller and callee versions. The criteria for bypassing
the resolver are the following:
* If the callee's feature set is a subset of the caller's feature set,
then the callee is a candidate for direct call.
* Among such candidates the one of highest priority is the best match
and it shall be picked, unless there is a version of the callee with
higher priority than the best match which cannot be picked from a
higher priority caller (directly or through the resolver).
* For every higher priority callee version than the best match, there
is a higher priority caller version whose feature set availability
is implied by the callee's feature set.
Example:
Callers and Callees are ordered in decreasing priority.
The arrows indicate successful call redirections.
Caller Callee Explanation
=========================================================================
mops+sve2 --+--> mops all the callee versions are subsets of the
| caller but mops has the highest priority
|
mops --+ sve2 between mops and default callees, mops wins
sve sve between sve and default callees, sve wins
but sve2 does not have a high priority caller
default -----> default sve (callee) implies sve (caller),
sve2(callee) implies sve (caller),
mops(callee) implies mops(caller)
If we sink the and in and(load), CGP can hoist is back again to the
load, getting into an infinite loop. This prevents sinking the and in
this case.
Fixes#122074
These can generally be emitted using an ext instruction or mov from the
high half. The half half extracts can be free depending on the users,
but that is not handled here, just the basic costs. It originally
included all subvector extracts, but that was toned-down to just
half-vector extracts to try and help the mid end not breakup high/low
extracts without having the SLP vectorizer create a mess using other
shuffles.
This expands the recently added fp16 fpext and fpround costs to bf16.
Some of the costs are taken from the rough number of instructions
needed, some are a little aspirational.
https://godbolt.org/z/bGEEd1vsW
Once PR #112138 lands we are able to start vectorising more loops
that have uncountable early exits. The typical loop structure
looks like this:
vector.body:
...
%pred = icmp eq <2 x ptr> %wide.load, %broadcast.splat
...
%or.reduc = tail call i1 @llvm.vector.reduce.or.v2i1(<2 x i1> %pred)
%iv.cmp = icmp eq i64 %index.next, 4
%exit.cond = or i1 %or.reduc, %iv.cmp
br i1 %exit.cond, label %middle.split, label %vector.body
middle.split:
br i1 %or.reduc, label %found, label %notfound
found:
ret i64 1
notfound:
ret i64 0
The problem with this is that %or.reduc is kept live after the loop,
and since this is a boolean it typically requires making a copy of
the condition code register. For AArch64 this requires an additional
cset instruction, which is quite expensive for a typical find loop
that only contains 6 or 7 instructions.
This patch attempts to improve the codegen by sinking the reduction
out of the loop to the location of it's user. It's a lot cheaper to
keep the predicate alive if the type is legal and has lots of
registers for it. There is a potential downside in that a little
more work is required after the loop, but I believe this is worth
it since we are likely to spend most of our time in the loop.
== We were previously returning an invalid cost when truncating
anything to <vscale x 2 x i1>, which is incorrect since we can
generate perfectly good code for this.
== The costs for truncating legal or unpacked types to predicates
seemed overly optimistic. For example, when truncating
<vscale x 8 x i16> to <vscale x 8 x i1> we typically do
something like
and z0.h, z0.h, #0x1
cmpne p0.h, p0/z, z0.h, #0
I guess it might depend upon whether the input value is
generated in the same block or not and if we can avoid the
inreg zero-extend. However, it feels safe to take the more
conservative cost here.
== The costs for some truncates such as
trunc <vscale x 2 x i32> %a to <vscale x 2 x i16>
were 1, whereas in actual fact they are free and no instructions
are required.
== Also, for this
trunc <vscale x 8 x i32> %a to <vscale x 8 x i16>
it's just a single uzp1 instruction so I reduced the cost to 1.
In general, I've added costs for all cases where the destination
type is legal or unpacked. One unfortunate side effect of this
is the costs for some fixed-width truncates when using SVE now
look too optimistic.
With the introduction of CmpPredicate in 51a895a (IR: introduce struct
with CmpInst::Predicate and samesign), PatternMatch is one of the first
key pieces of infrastructure that must be updated to match a CmpInst
respecting samesign information. Implement this change to Cmp-matchers.
This is a preparatory step in migrating the codebase over to
CmpPredicate. Since we no functional changes are desired at this stage,
we have chosen not to migrate CmpPredicate::operator==(CmpPredicate)
calls to use CmpPredicate::getMatching(), as that would have visible
impact on tests that are not yet written: instead, we call
CmpPredicate::operator==(Predicate), preserving the old behavior, while
also inserting a few FIXME comments for follow-ups.
The base cost approximates the expansion code in SelectionDAGBuilder.
For the AArch64 cases that don't need generic expansion, fixed-length
search vectors have a higher cost than scalable vectors due to the extra
instructions to convert the boolean mask.
This adds some basic costs for fpext and fpround, many of which were
already handled by the generic costing routines but this does make some
adjustments for larger vector types that can use fcvtn+fcvtn2, as
opposed to fcvtn+fcvtn+concat.
These should now more closely match the codegen from
https://godbolt.org/z/r3P9Mf8ez, for example.
Some of the scalable tests have been split off to make the tests more
managable. AArch64TTIImpl::getCastInstrCost is also formatted to avoid the need
to fight against CI.
Add initial heuristics to selectively enable runtime unrolling for loops
where doing so is expected to be highly beneficial on Apple Silicon
CPUs.
To start with, we try to runtime-unroll small, single block loops, if
they have load/store dependencies, to expose more parallel memory
access streams [1] and to improve instruction delivery [2].
We also explicitly avoid runtime-unrolling for loop structures that may
limit the expected gains from runtime unrolling. Such loops include
loops with complex control flow (aren't innermost loops, have multiple
exits, have a large number of blocks), trip count expansion is
expensive and are expected to execute a small number of iterations.
Note that the heuristics here may be overly conservative and we err on
the side of avoiding runtime unrolling rather than unroll excessively.
They are all subject to further refinement.
Across a large set of workloads, this increase the total number of
unrolled loops by 2.9%.
[1] 4.6.10 in Apple Silicon CPU Optimization Guide
[2] 4.4.4 in Apple Silicon CPU Optimization Guide
Depends on https://github.com/llvm/llvm-project/pull/118316 for TTI
changes.
PR: https://github.com/llvm/llvm-project/pull/118317
- Sink splat operands to mul instructions for types where we can use the
lane-indexed variants.
- When sinking operands for [su]mull, also sink the ext instruction.
Extend the support for implicit selects in the form of OR with a ZExt
operand to support ADD and SUB binops as well. They similarly can form
implicit selects which can be profitable to convert back the branches.
PR: https://github.com/llvm/llvm-project/pull/115489
As vector element loads are free on SystemZ, this patch improves the cost
computation in getGatherCost() to reflect this.
getScalarizationOverhead() gets an optional parameter which can hold the actual
Values so that they in turn can be passed (by BasicTTIImpl) to
getVectorInstrCost().
SystemZTTIImpl::getVectorInstrCost() will now recognize a LoadInst and
typically return a 0 cost for it, with some exceptions.
This patch moves the `areInlineCompatible` implementation from multiple
subclasses (`AArch64TTIImpl`, `RISCVTTIImpl`, `WebAssemblyTTIImpl`) to
the base class `BasicTTIImpl`. The new implementation checks whether the
callee's target features are a subset of the caller's, enabling
consistent behavior across targets. Subclasses now simply delegate to
the base implementation, reducing code duplication and improving
maintainability.
Set the maximum interleaving factor to 4, aligning with the number of available
SIMD pipelines. This increases the number of vector instructions in the vectorised
loop body, enhancing performance during its execution. However, for very low
iteration counts, the vectorised body might not execute at all, leaving only the
epilogue loop to run. This issue affects e.g. cam4_r from SPEC FP, which
experienced a performance regression. To address this, the patch reduces the
minimum epilogue vectorisation factor from 16 to 8, enabling the epilogue to be
vectorised and largely mitigating the regression.
…er possible
In case of Neon, if there exists extractelement from lane != 0 such that
1. extractelement does not necessitate a move from vector_reg -> GPR
2. extractelement result feeds into fmul
3. Other operand of fmul is a scalar or extractelement from lane 0 or
lane equivalent to 0
then the extractelement can be merged with fmul in the backend and it
incurs no cost.
e.g.
```
define double @foo(<2 x double> %a) {
%1 = extractelement <2 x double> %a, i32 0
%2 = extractelement <2 x double> %a, i32 1
%res = fmul double %1, %2
ret double %res
}
```
`%2` and `%res` can be merged in the backend to generate:
`fmul d0, d0, v0.d[1]`
The change was tested with SPEC FP(C/C++) on Neoverse-v2.
**Compile time impact**: None
**Performance impact**: Observing 1.3-1.7% uplift on lbm benchmark with -flto depending upon the config.
This patch aims to reduce the include used by AArch64ISelLowering, allowing it
to be included by unittests so that they can reference the AArch64ISD nodes.
It:
- Moves the inclusion of AArch64SMEAttributes.h to the uses.
- Moves LowerPtrAuthGlobalAddressStatically to a static function, so that
AArch64PACKey is not required in the header.
- Moves the definitions of getExceptionPointerRegister to the cpp file, to
remove the reference of AArch64::X0.
Previously, the cost model was returning an invalid cost. This simply
moves the check from one place to another. This is mostly to make the
cost modeling code a bit easier to follow.
---------
Co-authored-by: Mel Chen <mel.chen@sifive.com>
Rename the function to reflect its correct behavior and to be consistent
with `Module::getOrInsertFunction`. This is also in preparation of
adding a new `Intrinsic::getDeclaration` that will have behavior similar
to `Module::getFunction` (i.e, just lookup, no creation).
Porting to TTI provides direct access to the instruction cost model,
which can enable instruction cost based sinking without introducing code
duplication.
Affected intrinsics:
llvm.aarch64.sve.fcvt.bf16f32
llvm.aarch64.sve.fcvtnt.bf16f32
The named intrinsics took a predicate based on the smallest element type
when it should be based on the largest. The intrinsics have been replace
by v2 equivalents and affected code ported to use them.
Patch includes changes to getSVEPredicateBitCast() that ensure the
generated code for the auto-upgraded old intrinsics is unchanged.
The "narrowing top" convert instructions leave the bottom half of active
elements untouched and thus the first paramater of their associated
intrinsic remains live even when there are no inactive lanes.
This follows in the spirit of 7d82c99403f615f6236334e698720bf979959704,
and extends the costing API for compares and selects to provide
information about the operands passed in an analogous manner. This
allows us to model the cost of materializing the vector constant, as
some select-of-constants are significantly more expensive than others
when you account for the cost of materializing the constants involved.
This is a stepping stone towards fixing
https://github.com/llvm/llvm-project/issues/109466. A separate SLP patch
will be required to utilize the new API.
fadd reduction with
1. Fast flag set
2. No of elements in input vector is power of 2 results in series of
faddp instructions. faddp instruction has latency/throughput identical
to fadd instruction and hence, we set relative cost=1 for faddp as well.
The change didn't show any regression with SPEC17-FP(C/C++),
llvm-test-suite on Neoverse-V2.