Using the latest version of the script from D103695 to compare costmodel vs llvm-mca statistics.
Avoids using the default costs, which was assuming libm calls.
Without fastmath (nnan) flags, minnum/maxnum must perform isnan handling as well as fmin/fmax - meaning the costs are notably higher, this is correctly handled in getIntrinsicInstrCost but was missing from the getMinMaxCost cost tables (which assumed fastmath).
Followup to 63c3895327839ba5b57f5b99ec9e888abf976ac6 which handled the integer cases
Similar to the getArithmeticReductionCost / getExtendedReductionCost calls (which really don't need to use std::optional<>).
This will be necessary to correct recognize fast/nnan fmax/fmul reductions which can avoid nan handling - which will allow us to remove the fmax/fmin special case in X86TTIImpl::getMinMaxCost and use getIntrinsicInstrCost like we do for integer reductions (63c3895327839ba5b57f5b99ec9e888abf976ac6).
Differential Revision: https://reviews.llvm.org/D148149
getMinMaxCost has an alternative set of min/max costs to getIntrinsicInstrCost that are only used by getMinMaxReductionCost, but are a lot less thorough and fallback to an expansion in most cases resulting in cost overestimations - we're better off just using getIntrinsicInstrCost.
getIntrinsicInstrCost is still missing complete FMINNUM/FMAXNUM costs, so until then getMinMaxCost will still be used for these, after that we can remove getMinMaxCost and have getMinMaxReductionCost call getIntrinsicInstrCost directly.
Fixes regression noticed in D148036
In Function getVectorInstrCost, situation Opcode == Instruction::ExtractElement
and Opcode == Instruction::InsertElement are all handled in the first 2 if-statements,
So we have no chance for the code in line 4401.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D145908
When all the pointers are off the same base address and have known
distances to each other these differences can be encoded into displacements
in x86 arch. So the only cost that matters is cost of the base GEP.
Differential Revision: https://reviews.llvm.org/D146102
In order to allow targets to disable interleaving for scalable vectors, pass the entire VF's ElementCount to getMaxInterleaveFactor.
This is based off of the approach used here: 8d36708507
The plan would then be to disable interleaving on scalable VFs on RISC-V in a follow up patch.
See https://reviews.llvm.org/D143723#4132349
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D144474
LoopUnroll estimates the loop size via getInstructionCost(),
but getInstructionCost() cannot pass CostKind to getVectorInstrCost().
And so does getShuffleCost() to getBroadcastShuffleOverhead(),
getPermuteShuffleOverhead(), getExtractSubvectorOverhead(),
and getInsertSubvectorOverhead().
To address this, this patch adds an argument CostKind to these
functions.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D142116
Need to include the cost of the initial insertelement to the cost of the
broadcasts. Also, need to adjust the cost of the gather/buildvector if
the element is inserted into poison/undef vector.
Differential Revision: https://reviews.llvm.org/D140498
value() has undesired exception checking semantics and calls
__throw_bad_optional_access in libc++. Moreover, the API is unavailable without
_LIBCPP_NO_EXCEPTIONS on older Mach-O platforms (see
_LIBCPP_AVAILABILITY_BAD_OPTIONAL_ACCESS).
This fixes clang.
As the added codegen test coverage shows,
there isn't that much difference between AVX512DQI and
baseline AVX512F codegen, DQI added `vpmovm2d`/`vpmovd2m`,
but with just the Foundation we can use `vpternlogd`/`vptestmd`
to do the same.
Avoid some getSExtValue()/getZExtValue() calls
Hopefully we can remove some of the getBitWidth() constraints as well, as many are just there as a proxy for legal types (albeit assuming x86_64).
This patch mechanically replaces None with std::nullopt where the
compiler would warn if None were deprecated. The intent is to reduce
the amount of manual work required in migrating from Optional to
std::optional.
This is part of an effort to migrate from llvm::Optional to
std::optional:
https://discourse.llvm.org/t/deprecating-llvm-optional-x-hasvalue-getvalue-getvalueor/63716
All of our insert/extract ops work on 128-bit lanes.
For `Insert`, we need to extract affected 128-bit lane,
unless it's being fully overwritten (FIXME: do we need to be
careful about legalization-induced padding that we obviously don't demand?),
perform insertions, and then insert the 128-bit lane back.
But hold on. If we are operating on an 256-bit legal vector,
and thus have two 128-bit subvectors, and are fully overwriting them both,
we don't actually need to insert *both* subvectors,
only the second one, into the implicitly-widened first one.
Also, `Insert` wasn't actually querying the costs,
but just assuming them to be `1`.
`getShuffleCost(TTI::SK_ExtractSubvector)` notes:
```
// Note that in general, the insertion starting at the beginning of a vector
// isn't free, because we need to preserve the rest of the wide vector.
```
... so as far as i can tell, we didn't account for that.
I was hoping this would allow vectorization at a higher VF at one case i looked at,
but the subvector insertion cost is still dis-advising that.
The change for `Extract` is NFC, and is for consistency only,
i wanted to get rid of of that weird explicit discounting of insertion of 0'th element,
since the general code should already deal with that.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D137913
The mul by constant costmodels handle power-of-2 constants, but not negated-power-of-2, despite the backends handling both.
This patch adds the OperandValueProperties::OP_NegatedPowerOf2 enum and wires it for use for basic mul cost analysis and SLP handling.
Fixes#50778
Differential Revision: https://reviews.llvm.org/D111968
This mainly just adds costs for the targets where we have actual funnelshift/rotate instructions (VBMI2/XOP etc.) - the cases where we expand still need addressing, although for many the default shift+or expansion, especially for uniform cases, isn't that bad.
This was achieved with the 'cost-tables vs llvm-mca' script D103695
Add costs for the funnel shift instructions - fixes some discrepancies I was hitting with costs numbers from the 'cost-tables vs llvm-mca' script D103695
This was achieved with an updated version of the 'cost-tables vs llvm-mca' script D103695 (and recent fixes to the bdver2 + alderlake models)
Adding full CostKinds costs are affecting some other tests as they make assumptions about SizeLatency costs, so they need addressing first
These were based off a mixture of vector integer add/sub costs and the numbers from the 'cost-tables vs llvm-mca' script from D103695 - the extra costs for different predicates are still proving tricky to implement, but I've gotten most costs to within +/1 now - the AVX512 are tricky as we still don't handle predicate results properly, so most of these were done by hand.
These are the worst case generic vector shift costs, where nothing is known about the shift amounts - in particular this should stop us using the default sizelatency cost of 1 for so many pre-AVX2 vector shifts that can often actually expand during lowering to +20 uops, just for 128-bit vectors, resulting in some horrible inline/unroll decisions.
This was achieved with an updated version of the 'cost-tables vs llvm-mca' script D103695 (I'll update the patch soon for reference)
Vector shift by const uniform is the cheapest shift instruction we have, non-const uniform have a marginally higher cost - some targets 'splat' the amount internally to use the shift-per-element instruction, others see a higher cost for the explicit zeroing of the upper bits for the (64-bit) shift amount.
This was achieved with an updated version of the 'cost-tables vs llvm-mca' script D103695 (I'll update the patch soon for reference)
Corrects the shift by constant costs to better account for them being converted to multiples for lowering - which demonstrates that we should probably be trying harder NOT to convert these to multiplies for some CPUs (v4i32 in particular).
Also remove new-pass-manager version of ExpandLargeDivRem because there is no way
yet to access TargetLowering in the new pass manager.
Differential Revision: https://reviews.llvm.org/D133691
They shouldn't be happening after XOP shift costs - AVX2 shift supports takes preference over XOP for everything but vXi8 shifts - the improvement is pretty limited as it only affects bdver4 targets but it does help clean up a fraction of the messy shift cost logic....
For the few non type based intrinsic cases we can just check for !isTypeBasedOnly() to access the args directly.
I don't think we have a need to keep getTypeBasedIntrinsicInstrCost in BasicTTIImpl.h any more and can do a similar merge there as well - but it's a messier refactor and will take a while.
Begin the refactoring to use CostKindTblEntry and return real latency/codesize/sizelatency costs instead of reusing the throughput numbers
This should allow us to merge getTypeBasedIntrinsicInstrCost into getIntrinsicInstrCost and remove all remaining references