908 Commits

Author SHA1 Message Date
Simon Pilgrim
89ca3e72ca
[CostModel][X86] Reduce worst case v8i16/v16i8 SSE2 shuffle costs (#124789)
These were based off instruction count, not throughput - we can probably improve these further, but these throughput numbers match the worse expanded shuffles we see in the vector-shuffle-128-v* codegen tests.
2025-01-29 10:23:09 +00:00
Simon Pilgrim
7d172f96ff
[CostModel][X86] getShuffleCosts - convert all shuffle cost tables to be CostKind compatible. NFC. (#124753)
No change in actual costs yet, but split the costs per cost kind to make it easier to tweak the numbers in future patches.
2025-01-28 15:04:25 +00:00
Kazu Hirata
1782168c52 [X86] Fix a warning
This patch fixes:

  llvm/lib/Target/X86/X86TargetTransformInfo.cpp:1583:47: error:
  comparison of integers of different signs: 'size_t' (aka 'unsigned
  long') and 'typename iterator_traits<const int *>::difference_type'
  (aka 'long') [-Werror,-Wsign-compare]
2025-01-27 10:02:21 -08:00
Simon Pilgrim
178f47143a
[CostModel][X86] getShuffleCost - shuffles with only one defined element are always cheap (#124412)
If we're just moving a single element around inside a 128-bit lane (probably as an alternative to extracting it), we can assume this is cheap as a single PSRLDQ/PSHUFD/SHUFPS.

I've got the horrid feeling we're moving towards matching all SSE shuffle patterns inside the cost model, but I'm going to do my best to avoid this for now :|
2025-01-27 15:56:22 +00:00
Simon Pilgrim
dec47b76f4
[CostModel][X86] Update baseline CTTZ/CTLZ costs for x86_64 (#124312)
Followup to #123623 - now that the CMOV has been removed, the throughput has improved, reducing the benefit of vectorization on pre-x86-64-v3 CPUs
2025-01-26 14:43:51 +00:00
Simon Pilgrim
a12d7e4b61
[SLP] getVectorCallCosts - don't provide scalar argument data for vector IntrinsicCostAttributes (#124254)
getVectorCallCosts determines the cost of a vector intrinsic, based off
an existing scalar intrinsic call - but we were including the scalar
argument data to the IntrinsicCostAttributes, which meant that not only
was the cost calculation not type-only based, it was making incorrect
assumptions about constant values etc.

This also exposed an issue that x86 relied on fallback calculations for
funnel shift costs - this is great when we have the argument data as
that improves the accuracy of uniform shift amounts etc., but meant that
type-only costs would default to Cost=2 for all custom lowered funnel
shifts, which was far too cheap.

This is the reverse of #124129 where we weren't including argument data
when we could.

Fixes #63980
2025-01-24 15:13:13 +00:00
Simon Pilgrim
1fa56038f6 [CostModel][X86] getIntrinsicInstrCost - lrint/llrint costs can use getCastInstrCost without argument data
We don't use the IntrinsicCostAttributes arguments so, which allows us to use in type-only analysis in a future patch.
2025-01-24 09:12:07 +00:00
Alexey Bataev
bab7920fd7
[RISCV][CG]Use processShuffleMasks for per-register shuffles
Patch adds usage of processShuffleMasks in in codegen
in lowerShuffleViaVRegSplitting. This function is already used for X86
shuffles estimations and in DAGTypeLegalizer::SplitVecRes_VECTOR_SHUFFLE
functions, unifies the code.

Reviewers: topperc, wangpc-pp, lukel97, preames

Reviewed By: preames

Pull Request: https://github.com/llvm/llvm-project/pull/121765
2025-01-13 17:06:25 -05:00
Simon Pilgrim
a5e129ccde [CostModel][X86] getVectorInstrCost - correctly cost v4f32 insertelement into index 0
This is just the MOVSS instruction (SSE41 INSERTPS is still necessary for index != 0)

This exposed an issue in VectorCombine::foldInsExtFNeg - we need to use the more general SK_PermuteTwoSrc shuffle kind to allow getShuffleCost to match other shuffle kinds (not just SK_Select).
2025-01-07 12:23:45 +00:00
Simon Pilgrim
5a7dfb4659 [CostModel][X86] Attempt to match v4f32 shuffles that map to MOVSS/INSERTPS instruction
improveShuffleKindFromMask matches this as a SK_InsertSubvector of a v1f32 (which legalises to f32) into a v4f32 base vector, making it easy to recognise. MOVSS is limited to index0.
2025-01-07 11:31:44 +00:00
Simon Pilgrim
db88071a8b
[CostModel][X86] Attempt to match cheap v4f32 shuffles that map to SHUFPS instruction (#121778)
Avoid always assuming the worst for v4f32 2 input shuffles, and match the SHUFPS pattern where possible - each pair of output elements must come from the same source register.
2025-01-06 17:54:36 +00:00
Simon Pilgrim
7cdbde70fa
[CostModel][X86] getShuffleCost - use processShuffleMasks for all shuffle kinds to legal types (#120599) (#121760)
Now that processShuffleMasks can correctly handle 2 src shuffles, we can completely remove the shuffle kind limits and correctly recognize the number of active subvectors per legalized shuffle - improveShuffleKindFromMask will determine the shuffle kind for each split subvector.
2025-01-06 13:32:55 +00:00
Simon Pilgrim
611401c115 [CostModel][X86] getShuffleCost - use processShuffleMasks to split SK_PermuteTwoSrc shuffles to legal types (#120599)
processShuffleMasks can now correctly handle 2 src shuffles, so we can use the existing SK_PermuteSingleSrc splitting cost logic to handle SK_PermuteTwoSrc as well and correctly recognise the number of active subvectors per legalised shuffle.
2024-12-20 10:39:45 +00:00
Simon Pilgrim
091448e3c1
Revert "[CostModel][X86] getShuffleCost - use processShuffleMasks to split SK_PermuteTwoSrc shuffles to legal types" (#120707)
Reverts llvm/llvm-project#120599 - some recent tests are currently failing
2024-12-20 10:06:03 +00:00
Simon Pilgrim
81e63f9e0c
[CostModel][X86] getShuffleCost - use processShuffleMasks to split SK_PermuteTwoSrc shuffles to legal types (#120599)
processShuffleMasks can now correctly handle 2 src shuffles, so we can use the existing SK_PermuteSingleSrc splitting cost logic to handle SK_PermuteTwoSrc as well and correctly recognise the number of active subvectors per legalised shuffle.
2024-12-20 09:55:11 +00:00
Simon Pilgrim
9bb1d0369c
[X86] getShuffleCost - when splitting shuffles, if a whole vector source is just copied we should treat this as free. (#120561)
If the shuffle split results in referencing a single legalised whole vector (i.e. no permutation), then this can be treated as free.

We already do something similar for broadcasts / whole subvector insertion + extraction - its purely an issue for register allocation.
2024-12-19 12:55:44 +00:00
Simon Pilgrim
df2356b475 [X86] getShuffleCost - ensure we treat constant folded shuffles as free 2024-12-17 09:01:54 +00:00
Simon Pilgrim
4f933277a5
[CostModel][X86] Improve cost estimation of insert_subvector shuffle patterns of legalized types (#119363)
In cases where the base/sub vector type in an insert_subvector pattern legalize to the same width through splitting, we can assume that the shuffle becomes free as the legalized vectors will not overlap.

Note this isn't true if the vectors have been widened during legalization (e.g. v2f32 insertion into v4f32 would legalize to v4f32 into v4f32).

Noticed while working on adding processShuffleMasks handling for SK_PermuteTwoSrc.
2024-12-10 16:28:56 +00:00
Simon Pilgrim
53e9eee0e2 [X86][TTI] Use TargetCostConstants Free/Basic values instead of hard coded 0/1 to make the costs calculation more obvious. NFC. 2024-12-10 10:51:06 +00:00
Simon Pilgrim
85d15bd130
[TTI][X86] getMemoryOpCost - reduced costs when loading uniform values due to value reuse (#118642)
Similar to what we do for broadcast shuffles, when legalising load costs, if the value is known to be uniform, then we will only load a single vector and reuse this across the split legalised registers.

Fixes #111126
2024-12-04 16:36:00 +00:00
Simon Pilgrim
041e5c96c4 [X86] getMemoryOpCost - ensure we pass through OpInfo / Instruction args to base getMemoryOpCost calls
Nothing really uses these yet, but we shouldn't be losing the info.

We can also pass on the OpInfo arg to the getMemoryOpCost constant load call to indicate if its constant/uniform/pow2 etc.

Prep cleanup for #111126
2024-12-04 11:43:52 +00:00
Simon Pilgrim
94df95de6b
[TTI][X86] getShuffleCosts - for SK_PermuteTwoSrc, if the masks are known to be "inlane" no need to scale the costs by worst-case legalization (#117999)
SK_PermuteTwoSrc legalization has to assume any of the legalised source registers could be referenced in split shuffles, but if we already know that each 128-bit lane only references elements from the same lane of the source operands, then this scaling won't occur.

Hopefully this can help with #113356 without us having to get full processShuffleMasks canonicalization finished first.
2024-12-01 12:01:47 +00:00
Jonas Paulsson
0ad6be1927
[SLPVectorizer, TargetTransformInfo, SystemZ] Improve SLP getGatherCost(). (#112491)
As vector element loads are free on SystemZ, this patch improves the cost
computation in getGatherCost() to reflect this.

getScalarizationOverhead() gets an optional parameter which can hold the actual
Values so that they in turn can be passed (by BasicTTIImpl) to
getVectorInstrCost().

SystemZTTIImpl::getVectorInstrCost() will now recognize a LoadInst and
typically return a 0 cost for it, with some exceptions.
2024-11-29 21:19:45 +01:00
Kazu Hirata
dfe43bd1ca
[X86] Remove unused includes (NFC) (#115593)
Identified with misc-include-cleaner.
2024-11-09 08:23:46 -08:00
Simon Pilgrim
ac1869aa70
[CostModel][X86] Add initial costs for non-lane-crossing one/two input shuffles (#114680)
Most of the x86 shuffle instructions operate within each 128-bit subvector lane, but our shuffle costs struggle to handle this and have to fallback to worst case shuffles that reference elements from any lane.

This patch detects shuffle masks that we know are "inlane" and enable us to assume a cheaper shuffle cost.
2024-11-04 10:19:02 +00:00
Matthias Braun
255e441613
X86: Do not return invalid cost for fp16 conversion (#114128)
Returning invalid instruction costs when converting from/to fp16 in
`X86TTIImpl::getCastInstrCost` when there is no hardware support
available was triggering asserts. This changes the code to return a
large (arbitrary) number to model the fact that libcalls are used to
implement the conversion.

This also simplifies the code by only reporting costs for the scalar
fp16 conversion; vectorized costs being left to the fallback assuming
scalarization.

This is a follow-up to assertion issues reported for the changes in
#113195
2024-10-29 17:16:17 -07:00
Matthias Braun
054c23d78f
X86: Improve cost model of fp16 conversion (#113195)
Improve cost-modeling for x86 __fp16 conversions so the SLPVectorizer
transforms the patterns:

- Override `X86TTIImpl::getStoreMinimumVF` to report a minimum VF of 4 (SSE
  register can hold 4xfloat converted/stored to 4xf16) this is necessary as
  fp16 stores are neither modeled as trunc-stores nor can we mark direct Xxfp16
  stores as legal as we generally expand fp16 operations).
- Add missing cost entries to `X86TTIImpl::getCastInstrCost`
  conversion from/to fp16. Note that conversion from f64 to f16 is not
  supported by an X86 instruction.
2024-10-25 16:22:24 -07:00
Jeffrey Byrnes
853c43d04a
[TTI] NFC: Port TLI.shouldSinkOperands to TTI (#110564)
Porting to TTI provides direct access to the instruction cost model,
which can enable instruction cost based sinking without introducing code
duplication.
2024-10-09 14:30:09 -07:00
Simon Pilgrim
8b6e1dc924 [X86] getIntImmCostInst - reduce i64 imm costs of AND(X,CMASK) case that can fold to BEXT/BZHI
With BEXT/BZHI the i64 imm mask will be replaced with a i16/i8 control mask

Fixes #111323
2024-10-07 12:55:54 +01:00
Simon Pilgrim
c978d05a26 [X86] getIntImmCostInst - pull out repeated Imm.getBitWidth() calls. NFC. 2024-10-07 12:44:59 +01:00
Philip Reames
d288574363
[TTI][RISCV] Model cost of loading constants arms of selects and compares (#109824)
This follows in the spirit of 7d82c99403f615f6236334e698720bf979959704,
and extends the costing API for compares and selects to provide
information about the operands passed in an analogous manner. This
allows us to model the cost of materializing the vector constant, as
some select-of-constants are significantly more expensive than others
when you account for the cost of materializing the constants involved.

This is a stepping stone towards fixing
https://github.com/llvm/llvm-project/issues/109466. A separate SLP patch
will be required to utilize the new API.
2024-09-25 07:25:57 -07:00
Jay Foad
e03f427196
[LLVM] Use {} instead of std::nullopt to initialize empty ArrayRef (#109133)
It is almost always simpler to use {} instead of std::nullopt to
initialize an empty ArrayRef. This patch changes all occurrences I could
find in LLVM itself. In future the ArrayRef(std::nullopt_t) constructor
could be deprecated or removed.
2024-09-19 16:16:38 +01:00
Simon Pilgrim
f673882323
[X86] Allow speculative BSR/BSF instructions on targets with CMOV (#102885)
Currently targets without LZCNT/TZCNT won't speculate with BSR/BSF instructions in case they have a zero value input, meaning we always insert a test+branch for the zero-input case.

This patch proposes we allow speculation if the target has CMOV, and perform a branchless select instead to handle the zero input case. This will predominately help x86-64 targets where we haven't set any particular cpu target. We already always perform BSR/BSF instructions if we were lowering a CTLZ/CTTZ_ZERO_UNDEF instruction.
2024-08-22 11:11:00 +01:00
Simon Pilgrim
254da5ab8b [CostModel][X86] Add missing costkinds for scalar CTLZ/CTTZ instructions
Baed off worst case llvm-mca numbers for CTLZ/CTTZ(+ZERO_UNDEF) codegen

Prep work for #102885
2024-08-20 15:26:04 +01:00
Simon Pilgrim
3ef9220811 [CostModel][X86] Add missing AVX512 vector mul overflow intrinsic costs
Fix regressions in #100519
2024-07-29 15:53:42 +01:00
Simon Pilgrim
f2a0f97f7e [CostModel][X86] Improve vector mul overflow intrinsic costs 2024-07-26 14:34:13 +01:00
Simon Pilgrim
010dcfd85f [CostModel][X86] Improve add/sub/mul overflow intrinsic costs
Noticed due to x86 changes in #97463
2024-07-25 16:01:05 +01:00
Simon Pilgrim
b2b68c241a [CostModel][X86] Add add/sat sat intrinsic costs
Fixes regressions from #97463 due to missing costs for custom lowered ops
2024-07-25 14:35:39 +01:00
Tianqing Wang
3d494bfc7f
[SimplifyCFG] Increase budget for FoldTwoEntryPHINode() if the branch is unpredictable. (#98495)
The `!unpredictable` metadata has been present for a long time, but
it's usage in optimizations is still limited. This patch teaches
`FoldTwoEntryPHINode()` to be more aggressive with an unpredictable
branch to reduce mispredictions.

A TTI interface `getBranchMispredictPenalty()` is added to distinguish
between different hardwares to ensure we don't go too far for simpler
cores. For simplicity, only a naive x86 implementation is included for
the time being.
2024-07-23 07:47:21 +08:00
Shengchen Kan
a48305e0f9 [X86][CodeGen] Convert masked.load/store to CLOAD/CSTORE node only when vector size = 1
This fixes the crash when building llvm-test-suite with avx512f + cf.
2024-07-05 15:50:21 +08:00
Shengchen Kan
15fc801cf0
[X86][CodeGen] Support hoisting load/store with conditional faulting (#96720)
1. Add TTI interface for conditional load/store.
2. Mark 1 x i16/i32/i64 masked load/store legal so that it's not
   legalized in pass scalarize-masked-mem-intrin.
3. Visit 1 x i16/i32/i64 masked load/store to build a target-specific
   CLOAD/CSTORE node to avoid error in
   `DAGTypeLegalizer::ScalarizeVectorResult`.
4. Combine DAG to simplify the nodes for CLOAD/CSTORE.
5. Lower CLOAD/CSTORE to CFCMOV by pattern match.

This is CodeGen part of #95515
2024-06-27 17:01:55 +08:00
Simon Pilgrim
a46a2c2b7d
[X86] Lower vXi8 multiplies using PMADDUBSW on SSSE3+ targets (#95690)
Extends https://github.com/llvm/llvm-project/pull/95403 to handle non-constant cases - we can avoid unpacks/extensions from vXi8 to vXi16 by using PMADDUBSW instead and truncating the vXi16 results back together.

Most targets benefit from performing this for non-constant cases - its just Intel Core/SandyBridge era CPUs that might experience additional Port0/15 contention (but lower instruction count).

Fixes https://github.com/llvm/llvm-project/issues/90748
2024-06-25 12:25:56 +01:00
Nikita Popov
f2f18459d4 Revert "Intrinsic: introduce minimumnum and maximumnum (#93841)"
As far as I can tell, this pull request was not approved, and
did not go through an RFC on discourse.

This reverts commit 89881480030f48f83af668175b70a9798edca2fb.
This reverts commit 225d8fc8eb24fb797154c1ef6dcbe5ba033142da.
2024-06-21 08:34:04 +02:00
YunQiang Su
8988148003
Intrinsic: introduce minimumnum and maximumnum (#93841)
Currently, on different platform, the behaivor of llvm.minnum is
different if one operand is sNaN:

When we compare sNaN vs NUM:

ARM/AArch64/PowerPC: follow the IEEE754-2008's minNUM: return qNaN.
RISC-V/Hexagon follow the IEEE754-2019's minimumNumber: return NUM. X86:
Returns NUM but not same with IEEE754-2019's minimumNumber as
     +0.0 is not always greater than -0.0.
MIPS/LoongArch/Generic: return NUM.
LIBCALL: returns qNaN.

So, let's introduce llvm.minmumnum/llvm.maximumnum, which always follow
IEEE754-2019's minimumNumber/maximumNumber.

Half-fix: #93033
2024-06-21 11:53:08 +08:00
Simon Pilgrim
fe3f8ad8cc [X86] getIntrinsicInstrCost - begin generalizing BSWAP load/store-folding handling.
Move load/store folding 'free costs' inside the adjustTableCost helper so we can some additional intrinsics in the future.

The plan is to do something similar for other costs callbacks as well (getArithmeticInstrCost etc.).
2024-06-17 18:01:12 +01:00
Simon Pilgrim
22530e7985 [CostModel][X86] Update vXi8 mul costs for AVX512BW/AVX2/AVX1/SSE
Later levels were inheriting some of the worst case costs from SSE/AVX1 etc.

Based off llvm-mca numbers from the check_cost_tables.py script in https://github.com/RKSimon/llvm-scripts

Cleanup prep work for #90748
2024-06-16 07:27:35 +01:00
Simon Pilgrim
995ba4afcd [CostModel][X86] Adjust ABS scalar SizeLatency cost to 3uops
This was previously set to 4uops which was including the cost of extra register moves in the original test code.
2024-06-11 10:29:18 +01:00
Shengchen Kan
e282118f47 [X86][TTI] Update the return value of X86TTIImpl::getNumberOfRegisters for EGPR 2024-06-05 13:45:01 +08:00
Jan Patrick Lehr
3082258d3a
[CodeGen][X86] Use TargetLowering for TypeInfo of PointerTy (#93469)
This uses the TargetLowering getSimpleValueType mechanism to retrieve
the ValueType info inside the X86 cost model.

This resolves a build issue we were seeing for the miniQMC application after
https://github.com/llvm/llvm-project/pull/92671.
2024-05-29 14:42:48 +02:00
Simon Pilgrim
e8877b29a3 [X86] getGatherScatterOpCost- remove unnecessary extra brackets. NFC. 2024-05-28 10:03:04 +01:00