The check for ABI differences for inlined calls involves the caller, the
callee and the nested callee. Before inlining, the ABI is determined by
the target features of the callee. After inlining it is determined by
the caller. The features of the nested callee should never actually
matter.
This patch add cost kind to `getAddressComputationCost()` for #149955.
Note that this patch also remove all the default value in `getAddressComputationCost()`.
In some places we were passing the type of value being accessed, in
other cases we were passing the type of the pointer for the access.
The most "involved" user is
LoopVectorizationCostModel::getMemInstScalarizationCost, which is the
only call site that passes in the SCEV, and it passes along the pointer
type.
This changes call sites to consistently pass the pointer type, and
renames the arguments to clarify this.
No target actually checks the contents of the type passed, only to see
if it's a vector or not, so this shouldn't have an effect.
When these were converted to CostKindTblEntry the throughput was mainly
copied to all cost kinds
Regenerated with my check_cost_tables.py helper script
When these were converted to CostKindTblEntry the throughput was mainly copied to all cost kinds
Regenerated with my check_cost_tables.py helper script
The memory value and mask value types might legalise differently - e.g. a v64i32 might split into 4 x v16i32 / 8 x v8i32 but the mask might legalize as 1 x v64i8 / 2 x v32i8 etc.
If the legalised value type has been split, then we must ensure we compute the cost for the entire mask value type and let getShuffleCost handle any legalisation, not assume that only a single trailing split mask will require widening.
A shuffle will take two input vectors and a mask, to produce a new
vector of size <MaskElts x SrcEltTy>. Historically it has been assumed
that the SrcTy and the DstTy are the same for getShuffleCost, with that
being relaxed in recent years. If the Tp passed to getShuffleCost is the
SrcTy, then the DstTy can be calculated from the Mask elts and the src
elt size, but the Mask is not always provided and the Tp is not reliably
always the SrcTy. This has led to situations notably in the SLP
vectorizer but also in the generic cost routines where assumption about
how vectors will be legalized are built into the generic cost routines -
for example whether they will widen or promote, with the cost modelling
assuming they will widen but the default lowering to promote for integer
vectors.
This patch attempts to start improving that - it originally tried to
alter more of the cost model but that too quickly became too many
changes at once, so this patch just plumbs in a DstTy to getShuffleCost
so that DstTy and SrcTy can be reliably distinguished. The callers of
getShuffleCost have been updated to try and include a DstTy that is more
accurate. Otherwise it tries to be fairly non-functional, keeping the
SrcTy used as the primary type used in shuffle cost routines, only using
DstTy where it was in the past (for InsertSubVector for example).
Some asserts have been added that help to check for consistent values
when a Mask and a DstTy are provided to getShuffleCost. Some of them
took a while to get right, and some non-mask calls might still be
incorrect. Hopefully this will provide a useful base to build more
shuffles that alter size.
Increase specificity by using the correct unit sizes. KBytes is an
abbreviation for kB, 1000 bytes, and the hardware industry as well as
several operating systems have now switched to using 1000 byte kBs.
If this change is acceptable, sometimes GitHub mangles merges to use the
original email of the account. $dayjob asks contributions have my work
email. Thanks!
If the lower 32-bits of a i64 value are known to be zero, then icmp
lowering will shift+truncate down to a i32 allowing the immediate to be
embedded.
There's a lot more that could be done here to match icmp lowering, but
this PR just focuses on known regressions.
Fixes#142513Fixes#62145
`ad9909d "[SLP]Fix perfect diamond match with extractelements in scalars" `
changed SLPVectorizer getScalarizationOverhead() to call
TTI.getVectorInstrCost() instead of TTI.getScalarizationOverhead() in some
cases. This was due to X86 specific handlings in these (overridden) methods,
and unfortunately the general preference of TTI.getScalarizationOverhead()
was dropped. If VL is available it should always be preferred to use
getScalarizationOverhead(), and this is indeed the case for SystemZ which
has a special insertion instruction that can insert two GPR64s.
Then ` 33af951 "[SLP]Synchronize cost of gather/buildvector nodes with
codegen"` reworked SLPVectorizer getGatherCost() which together with
ad9909d caused the SystemZ test vec-elt-insertion.ll to fail.
This patch restores the SystemZ test and reverts the change in SLPVectorizer
getScalarizationOverhead() so that TTI.getScalarizationOverhead() is always
called again. The ForPoisonSrc argument is now passed on to the TTI method
so that X86 can handle this as required.
Fixes: #135346
InstructionCost is already an optional value, containing an Invalid
state that can be checked with isValid(). There is little point in
returning another optional from getValue(). Most uses do not make use of
it being a std::optional, dereferencing the value directly (either
isValid has been checked previously or the Cost is assumed to be valid).
The one case that does in AMDGPU used value_or which has been replaced
by a isValid() check.
These are not diagnosed because implementations hide the methods of the base class rather than overriding them.
This works as long as a hiding function is callable with the same arguments as the same function from the base class.
Pull Request: https://github.com/llvm/llvm-project/pull/136655
Making `TargetTransformInfo::Model::Impl` `const` makes sure all
interface methods are `const`, in `BasicTTIImpl`, its bases, and in all
derived classes.
Pull Request: https://github.com/llvm/llvm-project/pull/136598
The main change is making `thisT` method `const`, the rest of the
changes is fixing compilation errors (*).
(*) There are two tricky methods, `getVectorInstrCost()` and
`getIntImmCost()`.
They have several overloads; some of these overloads are typically
pulled in to derived classes using the `using` directive, and then
hidden by methods in the derived class.
The compiler does not complain if the hiding methods are not marked as
`const`, which means that clients will use the methods from the base
class. If after this change your target fails cost model tests, this
must be the reason. To resolve the issue you need to make all hiding
overloads `const`. See the second commit in this PR.
Pull Request: https://github.com/llvm/llvm-project/pull/136575
I noticed these destructors taking time with -ftime-trace and moved some
of them for minor build efficiency improvements.
The main impact of moving destructors out of line is that it avoids
requiring container fields containing other types from being complete,
i.e. one can have uptr<T> or vector<T> as a field with an incomplete
type T, and that means we can reduce transitive includes, as with
LegalizerInfo.h.
Move expensive getDebugOperandsForReg template out-of-line. The
std::function instantiation shows up in time trace even if you don't use
the function.
In order to facilitate targets that only support masked loads/stores
on certain address spaces (AMDGPU will support them in an upcoming
patch, but only for address space 7), add an AddressSpace parameter
to isLegalMaskedLoad and isLegalMaskedStore
Historically this function return an int with negative values meaning
invalid. It was migrated to InstructionCost in 43ace8b5ce07a, but the
code was not updated to return invalid cost instead of -1. In that
commit, the caller in LSR was updated to assert that the cost is valid
instead of positive. We should return invalid instead of a negative
value so LSR will assert if the cost isn't valid.
These were based off instruction count, not throughput - we can probably improve these further, but these throughput numbers match the worse expanded shuffles we see in the vector-shuffle-128-v* codegen tests.
This patch fixes:
llvm/lib/Target/X86/X86TargetTransformInfo.cpp:1583:47: error:
comparison of integers of different signs: 'size_t' (aka 'unsigned
long') and 'typename iterator_traits<const int *>::difference_type'
(aka 'long') [-Werror,-Wsign-compare]
If we're just moving a single element around inside a 128-bit lane (probably as an alternative to extracting it), we can assume this is cheap as a single PSRLDQ/PSHUFD/SHUFPS.
I've got the horrid feeling we're moving towards matching all SSE shuffle patterns inside the cost model, but I'm going to do my best to avoid this for now :|
getVectorCallCosts determines the cost of a vector intrinsic, based off
an existing scalar intrinsic call - but we were including the scalar
argument data to the IntrinsicCostAttributes, which meant that not only
was the cost calculation not type-only based, it was making incorrect
assumptions about constant values etc.
This also exposed an issue that x86 relied on fallback calculations for
funnel shift costs - this is great when we have the argument data as
that improves the accuracy of uniform shift amounts etc., but meant that
type-only costs would default to Cost=2 for all custom lowered funnel
shifts, which was far too cheap.
This is the reverse of #124129 where we weren't including argument data
when we could.
Fixes#63980
Patch adds usage of processShuffleMasks in in codegen
in lowerShuffleViaVRegSplitting. This function is already used for X86
shuffles estimations and in DAGTypeLegalizer::SplitVecRes_VECTOR_SHUFFLE
functions, unifies the code.
Reviewers: topperc, wangpc-pp, lukel97, preames
Reviewed By: preames
Pull Request: https://github.com/llvm/llvm-project/pull/121765
This is just the MOVSS instruction (SSE41 INSERTPS is still necessary for index != 0)
This exposed an issue in VectorCombine::foldInsExtFNeg - we need to use the more general SK_PermuteTwoSrc shuffle kind to allow getShuffleCost to match other shuffle kinds (not just SK_Select).
improveShuffleKindFromMask matches this as a SK_InsertSubvector of a v1f32 (which legalises to f32) into a v4f32 base vector, making it easy to recognise. MOVSS is limited to index0.
Avoid always assuming the worst for v4f32 2 input shuffles, and match the SHUFPS pattern where possible - each pair of output elements must come from the same source register.
Now that processShuffleMasks can correctly handle 2 src shuffles, we can completely remove the shuffle kind limits and correctly recognize the number of active subvectors per legalized shuffle - improveShuffleKindFromMask will determine the shuffle kind for each split subvector.
processShuffleMasks can now correctly handle 2 src shuffles, so we can use the existing SK_PermuteSingleSrc splitting cost logic to handle SK_PermuteTwoSrc as well and correctly recognise the number of active subvectors per legalised shuffle.
processShuffleMasks can now correctly handle 2 src shuffles, so we can use the existing SK_PermuteSingleSrc splitting cost logic to handle SK_PermuteTwoSrc as well and correctly recognise the number of active subvectors per legalised shuffle.
If the shuffle split results in referencing a single legalised whole vector (i.e. no permutation), then this can be treated as free.
We already do something similar for broadcasts / whole subvector insertion + extraction - its purely an issue for register allocation.
In cases where the base/sub vector type in an insert_subvector pattern legalize to the same width through splitting, we can assume that the shuffle becomes free as the legalized vectors will not overlap.
Note this isn't true if the vectors have been widened during legalization (e.g. v2f32 insertion into v4f32 would legalize to v4f32 into v4f32).
Noticed while working on adding processShuffleMasks handling for SK_PermuteTwoSrc.
Similar to what we do for broadcast shuffles, when legalising load costs, if the value is known to be uniform, then we will only load a single vector and reuse this across the split legalised registers.
Fixes#111126
Nothing really uses these yet, but we shouldn't be losing the info.
We can also pass on the OpInfo arg to the getMemoryOpCost constant load call to indicate if its constant/uniform/pow2 etc.
Prep cleanup for #111126
SK_PermuteTwoSrc legalization has to assume any of the legalised source registers could be referenced in split shuffles, but if we already know that each 128-bit lane only references elements from the same lane of the source operands, then this scaling won't occur.
Hopefully this can help with #113356 without us having to get full processShuffleMasks canonicalization finished first.
As vector element loads are free on SystemZ, this patch improves the cost
computation in getGatherCost() to reflect this.
getScalarizationOverhead() gets an optional parameter which can hold the actual
Values so that they in turn can be passed (by BasicTTIImpl) to
getVectorInstrCost().
SystemZTTIImpl::getVectorInstrCost() will now recognize a LoadInst and
typically return a 0 cost for it, with some exceptions.
Most of the x86 shuffle instructions operate within each 128-bit subvector lane, but our shuffle costs struggle to handle this and have to fallback to worst case shuffles that reference elements from any lane.
This patch detects shuffle masks that we know are "inlane" and enable us to assume a cheaper shuffle cost.
Returning invalid instruction costs when converting from/to fp16 in
`X86TTIImpl::getCastInstrCost` when there is no hardware support
available was triggering asserts. This changes the code to return a
large (arbitrary) number to model the fact that libcalls are used to
implement the conversion.
This also simplifies the code by only reporting costs for the scalar
fp16 conversion; vectorized costs being left to the fallback assuming
scalarization.
This is a follow-up to assertion issues reported for the changes in
#113195
Improve cost-modeling for x86 __fp16 conversions so the SLPVectorizer
transforms the patterns:
- Override `X86TTIImpl::getStoreMinimumVF` to report a minimum VF of 4 (SSE
register can hold 4xfloat converted/stored to 4xf16) this is necessary as
fp16 stores are neither modeled as trunc-stores nor can we mark direct Xxfp16
stores as legal as we generally expand fp16 operations).
- Add missing cost entries to `X86TTIImpl::getCastInstrCost`
conversion from/to fp16. Note that conversion from f64 to f16 is not
supported by an X86 instruction.
Porting to TTI provides direct access to the instruction cost model,
which can enable instruction cost based sinking without introducing code
duplication.