This commit introduces the VectorInstrContext (VIC) infrastructure to
improve cost estimates for insert/extracts based on the context
instruction in which the insert/extract is used.
This is similar to CastContextHint, and allows providing context on how
the insert/extract is going to be used before creating IR. This is
useful in the LoopVectorizer, where costs need to estimated before
creating IR.
The new hint currently only replaces an existing check in AArch64,
but new uses will be introduced in follow-ups, including
https://github.com/llvm/llvm-project/pull/177201.
PR: https://github.com/llvm/llvm-project/pull/175982
This reverts https://github.com/llvm/llvm-project/pull/132976.
The PR incorrectly claimed that this makes inlining more liberal,
referencing the string comparison in TargetTransformInfoImpl.h.
However, the implementation that actually applies is the one in
BasicTTIImpl.h, which performs a feature subset comparison. As such,
this regressed inlining, most concerningly of functions without +vector
into functions with +vector.
Revert the change to restore the previous behavior.
A shuffle will take two input vectors and a mask, to produce a new
vector of size <MaskElts x SrcEltTy>. Historically it has been assumed
that the SrcTy and the DstTy are the same for getShuffleCost, with that
being relaxed in recent years. If the Tp passed to getShuffleCost is the
SrcTy, then the DstTy can be calculated from the Mask elts and the src
elt size, but the Mask is not always provided and the Tp is not reliably
always the SrcTy. This has led to situations notably in the SLP
vectorizer but also in the generic cost routines where assumption about
how vectors will be legalized are built into the generic cost routines -
for example whether they will widen or promote, with the cost modelling
assuming they will widen but the default lowering to promote for integer
vectors.
This patch attempts to start improving that - it originally tried to
alter more of the cost model but that too quickly became too many
changes at once, so this patch just plumbs in a DstTy to getShuffleCost
so that DstTy and SrcTy can be reliably distinguished. The callers of
getShuffleCost have been updated to try and include a DstTy that is more
accurate. Otherwise it tries to be fairly non-functional, keeping the
SrcTy used as the primary type used in shuffle cost routines, only using
DstTy where it was in the past (for InsertSubVector for example).
Some asserts have been added that help to check for consistent values
when a Mask and a DstTy are provided to getShuffleCost. Some of them
took a while to get right, and some non-mask calls might still be
incorrect. Hopefully this will provide a useful base to build more
shuffles that alter size.
`ad9909d "[SLP]Fix perfect diamond match with extractelements in scalars" `
changed SLPVectorizer getScalarizationOverhead() to call
TTI.getVectorInstrCost() instead of TTI.getScalarizationOverhead() in some
cases. This was due to X86 specific handlings in these (overridden) methods,
and unfortunately the general preference of TTI.getScalarizationOverhead()
was dropped. If VL is available it should always be preferred to use
getScalarizationOverhead(), and this is indeed the case for SystemZ which
has a special insertion instruction that can insert two GPR64s.
Then ` 33af951 "[SLP]Synchronize cost of gather/buildvector nodes with
codegen"` reworked SLPVectorizer getGatherCost() which together with
ad9909d caused the SystemZ test vec-elt-insertion.ll to fail.
This patch restores the SystemZ test and reverts the change in SLPVectorizer
getScalarizationOverhead() so that TTI.getScalarizationOverhead() is always
called again. The ForPoisonSrc argument is now passed on to the TTI method
so that X86 can handle this as required.
Fixes: #135346
Instead of always iterating over all GlobalVariable:s in the Module to
find the case where both Caller and Callee is using the same GV heavily,
first scan Callee (only if less than 200 instructions) for all GVs used
more than 10 times, and then do the counting for the Caller for just
those relevant GVs.
The limit of 200 instructions makes sense as this aims to inline a
relatively small function using a GV +10 times.
This resolves the compile time problem with zig where it is on main
(compared to removing the heuristic) a 380% increase, but with this
change <0.5% increase (total user compile time with opt).
Fixes#134714.
InstructionCost is already an optional value, containing an Invalid
state that can be checked with isValid(). There is little point in
returning another optional from getValue(). Most uses do not make use of
it being a std::optional, dereferencing the value directly (either
isValid has been checked previously or the Cost is assumed to be valid).
The one case that does in AMDGPU used value_or which has been replaced
by a isValid() check.
These are not diagnosed because implementations hide the methods of the base class rather than overriding them.
This works as long as a hiding function is callable with the same arguments as the same function from the base class.
Pull Request: https://github.com/llvm/llvm-project/pull/136655
Making `TargetTransformInfo::Model::Impl` `const` makes sure all
interface methods are `const`, in `BasicTTIImpl`, its bases, and in all
derived classes.
Pull Request: https://github.com/llvm/llvm-project/pull/136598
The main change is making `thisT` method `const`, the rest of the
changes is fixing compilation errors (*).
(*) There are two tricky methods, `getVectorInstrCost()` and
`getIntImmCost()`.
They have several overloads; some of these overloads are typically
pulled in to derived classes using the `using` directive, and then
hidden by methods in the derived class.
The compiler does not complain if the hiding methods are not marked as
`const`, which means that clients will use the methods from the base
class. If after this change your target fails cost model tests, this
must be the reason. To resolve the issue you need to make all hiding
overloads `const`. See the second commit in this PR.
Pull Request: https://github.com/llvm/llvm-project/pull/136575
## What?
Implement `areInlineCompatible` for the SystemZ target using
FeatureBitset comparison.
## Why?
The default implementation in `TargetTransformInfoImpl.h` makes a string
comparison and only inlines when the target-cpu and the target-features
for caller and callee are the same. We are missing out on optimizations
when the callee has a subset of features of the caller.
## How?
Get the FeatureBitset of the caller and callee and check when callee is
a subset or equal to the caller's features. It's a similar
implementation to ARM, PowerPC...
## Testing?
Test cases check for when the callee is a subset of the caller, when
it's not a subset and when both are equals.
CSmith found a case where SROA produces bitcasts from scalar to vector.
This was previously asserted against in SystemZTTI, but now the BaseT
implementation takes care of it.
This patch adds support for the next-generation arch15
CPU architecture to the SystemZ backend.
This includes:
- Basic support for the new processor and its features.
- Detection of arch15 as host processor.
- Assembler/disassembler support for new instructions.
- Exploitation of new instructions for code generation.
- New vector (signed|unsigned|bool) __int128 data types.
- New LLVM intrinsics for certain new instructions.
- Support for low-level builtins mapped to new LLVM intrinsics.
- New high-level intrinsics in vecintrin.h.
- Indicate support by defining __VEC__ == 10305.
Note: No currently available Z system supports the arch15
architecture. Once new systems become available, the
official system name will be added as supported -march name.
This PR adds more realistic cost estimates for these reduction
intrinsics
- `llvm.vector.reduce.umax`
- `llvm.vector.reduce.umin`
- `llvm.vector.reduce.smax`
- `llvm.vector.reduce.smin`
- `llvm.vector.reduce.fadd`
- `llvm.vector.reduce.fmul`
- `llvm.vector.reduce.fmax`
- `llvm.vector.reduce.fmin`
- `llvm.vector.reduce.fmaximum`
- `llvm.vector.reduce.fminimum`
- `llvm.vector.reduce.mul
`
The pre-existing cost estimates for `llvm.vector.reduce.add` are moved
to `getArithmeticReductionCosts` to reduce complexity in
`getVectorIntrinsicInstrCost` and enable other passes, like the SLP
vectorizer, to benefit from these updated calculations.
These are not expected to provide noticable performance improvements and
are rather provided for the sake of completeness and correctness. This
PR is in draft mode pending benchmark confirmation of this.
This also provides and/or updates cost tests for all of these
intrinsics.
This PR was co-authored by me and @JonPsson1 .
As vector element loads are free on SystemZ, this patch improves the cost
computation in getGatherCost() to reflect this.
getScalarizationOverhead() gets an optional parameter which can hold the actual
Values so that they in turn can be passed (by BasicTTIImpl) to
getVectorInstrCost().
SystemZTTIImpl::getVectorInstrCost() will now recognize a LoadInst and
typically return a 0 cost for it, with some exceptions.
Due to recently reported problems with having the inlining threshold multiplier
set fairly high (x3), this patch removes the multiplier while addressing
the regressions seen by doing so in adjustInliningThreshold().
The specific cases that benefit from inlining that were now found to be in need
of handling contain a considerable number of memory accesses to the same
memory in both caller and callee.
This follows in the spirit of 7d82c99403f615f6236334e698720bf979959704,
and extends the costing API for compares and selects to provide
information about the operands passed in an analogous manner. This
allows us to model the cost of materializing the vector constant, as
some select-of-constants are significantly more expensive than others
when you account for the cost of materializing the constants involved.
This is a stepping stone towards fixing
https://github.com/llvm/llvm-project/issues/109466. A separate SLP patch
will be required to utilize the new API.
In the existing version, SystemZTTIImpl::shouldExpandReduction will
create a `cast` error when handling vector reduction intrinsics that
do not have the vector to reduce as their first operand, such as
`llvm.vector.reduce.fadd` and `llvm.vector.reduce.fmul`.
This commit fixes that problem by moving the cast into the case
statement that handles the specific intrinsic, where the vector
operand position is well-known.
This commit skips the expansion of the `vector.reduce.add` intrinsic on
vector-enabled SystemZ targets in order to introduce custom handling of
`vector.reduce.add` for legal vector types using the VSUM instructions.
This is limited to full vectors with scalar types up to `i32` due to
performance concerns.
It also adds testing for the generation of such custom handling, and
adapts the related cost computation, as well as the testing for that.
The expected result is a performance boost in certain benchmarks that
make heavy use of `vector.reduce.add` with other benchmarks remaining
constant.
For instance, the assembly for `vector.reduce.add<4 x i32>` changes from
```hlasm
vmrlg %v0, %v24, %v24
vaf %v0, %v24, %v0
vrepf %v1, %v0, 1
vaf %v0, %v0, %v1
vlgvf %r2, %v0, 0
```
to
```hlasm
vgbm %v0, 0
vsumqf %v0, %v24, %v0
vlgvf %r2, %v0, 3
```
This tries to add some costs for the shuffle in a ST3/ST4 instruction,
which are represented in LLVM IR as store(interleaving shuffle). In
order to detect the store, it needs to add a CxtI context instruction to
check the users of the shuffle. LD3 and LD4 are added, LD2 should be a
zip1 shuffle, which will be added in another patch.
It should help fix some of the regressions from #87510.
Currently patchpoints can only have two result types, `void` and `i64`.
This limits the result to general purpose registers.
This patch makes `patchpoint.i64` an overloadable intrinsic, allowing
result values that can fit in a single register (e.g. integers,
pointers, floats).
This commit provides better cost estimates for
the llvm.vector.reduce.add intrinsic on SystemZ. These apply to all
vector lengths and integer types up to i128. For integer types larger
than i128, we fall back to the default cost estimate.
This has the effect of lowering the estimated costs of most common
instances of the intrinsic. The expected performance impact of this is
minimal with a tendency to slightly improve performance of some
benchmarks.
This commit also provides a test to check the proper computation of the
new estimates, as well as the fallback for types larger than i128.
This assert does not seem justified given that the LoopVectorizer can
form interleave groups containing i128 elements where the number of
elements per vector is indeed just one.
It seems TypeSize is currently broken in the sense that:
TypeSize::Fixed(4) + TypeSize::Scalable(4) => TypeSize::Fixed(8)
without failing its assert that explicitly tests for this case:
assert(LHS.Scalable == RHS.Scalable && ...);
The reason this fails is that `Scalable` is a static method of class
TypeSize,
and LHS and RHS are both objects of class TypeSize. So this is
evaluating
if the pointer to the function Scalable == the pointer to the function
Scalable,
which is always true because LHS and RHS have the same class.
This patch fixes the issue by renaming `TypeSize::Scalable` ->
`TypeSize::getScalable`, as well as `TypeSize::Fixed` to
`TypeSize::getFixed`,
so that it no longer clashes with the variable in
FixedOrScalableQuantity.
The new methods now also better match the coding standard, which
specifies that:
* Variable names should be nouns (as they represent state)
* Function names should be verb phrases (as they represent actions)
In SystemZTTIImpl::getMemoryOpCost, the call to getNumberOfParts will
run type legalization, which can't handle structs. So before that, we
check for an unknown value type and forward to BaseT, just like many
other targets do in this situation.
https://bugzilla.redhat.com/show_bug.cgi?id=2224885
Reviewed By: uweigand
Differential Revision: https://reviews.llvm.org/D156379
LoopUnroll estimates the loop size via getInstructionCost(),
but getInstructionCost() cannot pass CostKind to getVectorInstrCost().
And so does getShuffleCost() to getBroadcastShuffleOverhead(),
getPermuteShuffleOverhead(), getExtractSubvectorOverhead(),
and getInsertSubvectorOverhead().
To address this, this patch adds an argument CostKind to these
functions.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D142116
Need to include the cost of the initial insertelement to the cost of the
broadcasts. Also, need to adjust the cost of the gather/buildvector if
the element is inserted into poison/undef vector.
Differential Revision: https://reviews.llvm.org/D140498
This patch mechanically replaces None with std::nullopt where the
compiler would warn if None were deprecated. The intent is to reduce
the amount of manual work required in migrating from Optional to
std::optional.
This is part of an effort to migrate from llvm::Optional to
std::optional:
https://discourse.llvm.org/t/deprecating-llvm-optional-x-hasvalue-getvalue-getvalueor/63716
This has the effect of exposing the power-of-two property for use in memory op costing, but no target actually uses it yet. The main point of this change is simple consistency with the recently changes getArithmeticInstrCost, and to remove the last (interface) use of OperandValueKind.
This change completes the process of replacing OperandValueKind and OperandValueProperties which were previously passed independently in this API with a single container class which contains both.
This is the change which motivated the whole sequence which preceeded it. In an original spike version of this change, I'd noticed a nasty bug: I'd changed the signature without changing names, and as result, we silently passed additional information through a callsite which previously dropped the power-of-two fact. This might be harmless in most cases, but at least a couple clearly dependend for correctness on not passing that property through.
I did my best to split off prior changes which reduced the scope of this one, and which made it possible to use compiler assistance. For instance, every parameter which changes type in this change also changes name. This was intentional to make sure that every call site possible effected must show up in the diff. This let me audit each one closely.
Defaults to TCK_RecipThroughput - as most explicit calls were assuming TCK_RecipThroughput (vectorizers) or was just doing a before-vs-after comparison (vectorcombiner). Calls via getInstructionCost were just dropping the CostKind, so again there should be no change at this time (as getShuffleCost and its expansions don't use CostKind yet) - but it will make it easier for us to better account for size/latency shuffle costs in inline/unroll passes in the future.
Differential Revision: https://reviews.llvm.org/D132287
This patch boosts the inlining threshold for a particular type of functions
that are using an incoming argument only as a memcpy source.
Review: Ulrich Weigand
Differential Revision: https://reviews.llvm.org/D121341
Inspired by D111968, provide a isNegatedPowerOf2() wrapper instead of obfuscating code with (-Value).isPowerOf2() patterns, which I'm sure are likely avenues for typos.....
Differential Revision: https://reviews.llvm.org/D111998
I'm not sure this is the best way to approach this,
but the situation is rather not very detectable unless we explicitly call it out when refusing to advise to unroll.
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D107271
As pointed out in D101726, this function already exists in MathExtras.
It uses different types, but with the values used here I believe that
should not make a functional difference.