885 Commits

Author SHA1 Message Date
Kazu Hirata
dfe43bd1ca
[X86] Remove unused includes (NFC) (#115593)
Identified with misc-include-cleaner.
2024-11-09 08:23:46 -08:00
Simon Pilgrim
ac1869aa70
[CostModel][X86] Add initial costs for non-lane-crossing one/two input shuffles (#114680)
Most of the x86 shuffle instructions operate within each 128-bit subvector lane, but our shuffle costs struggle to handle this and have to fallback to worst case shuffles that reference elements from any lane.

This patch detects shuffle masks that we know are "inlane" and enable us to assume a cheaper shuffle cost.
2024-11-04 10:19:02 +00:00
Matthias Braun
255e441613
X86: Do not return invalid cost for fp16 conversion (#114128)
Returning invalid instruction costs when converting from/to fp16 in
`X86TTIImpl::getCastInstrCost` when there is no hardware support
available was triggering asserts. This changes the code to return a
large (arbitrary) number to model the fact that libcalls are used to
implement the conversion.

This also simplifies the code by only reporting costs for the scalar
fp16 conversion; vectorized costs being left to the fallback assuming
scalarization.

This is a follow-up to assertion issues reported for the changes in
#113195
2024-10-29 17:16:17 -07:00
Matthias Braun
054c23d78f
X86: Improve cost model of fp16 conversion (#113195)
Improve cost-modeling for x86 __fp16 conversions so the SLPVectorizer
transforms the patterns:

- Override `X86TTIImpl::getStoreMinimumVF` to report a minimum VF of 4 (SSE
  register can hold 4xfloat converted/stored to 4xf16) this is necessary as
  fp16 stores are neither modeled as trunc-stores nor can we mark direct Xxfp16
  stores as legal as we generally expand fp16 operations).
- Add missing cost entries to `X86TTIImpl::getCastInstrCost`
  conversion from/to fp16. Note that conversion from f64 to f16 is not
  supported by an X86 instruction.
2024-10-25 16:22:24 -07:00
Jeffrey Byrnes
853c43d04a
[TTI] NFC: Port TLI.shouldSinkOperands to TTI (#110564)
Porting to TTI provides direct access to the instruction cost model,
which can enable instruction cost based sinking without introducing code
duplication.
2024-10-09 14:30:09 -07:00
Simon Pilgrim
8b6e1dc924 [X86] getIntImmCostInst - reduce i64 imm costs of AND(X,CMASK) case that can fold to BEXT/BZHI
With BEXT/BZHI the i64 imm mask will be replaced with a i16/i8 control mask

Fixes #111323
2024-10-07 12:55:54 +01:00
Simon Pilgrim
c978d05a26 [X86] getIntImmCostInst - pull out repeated Imm.getBitWidth() calls. NFC. 2024-10-07 12:44:59 +01:00
Philip Reames
d288574363
[TTI][RISCV] Model cost of loading constants arms of selects and compares (#109824)
This follows in the spirit of 7d82c99403f615f6236334e698720bf979959704,
and extends the costing API for compares and selects to provide
information about the operands passed in an analogous manner. This
allows us to model the cost of materializing the vector constant, as
some select-of-constants are significantly more expensive than others
when you account for the cost of materializing the constants involved.

This is a stepping stone towards fixing
https://github.com/llvm/llvm-project/issues/109466. A separate SLP patch
will be required to utilize the new API.
2024-09-25 07:25:57 -07:00
Jay Foad
e03f427196
[LLVM] Use {} instead of std::nullopt to initialize empty ArrayRef (#109133)
It is almost always simpler to use {} instead of std::nullopt to
initialize an empty ArrayRef. This patch changes all occurrences I could
find in LLVM itself. In future the ArrayRef(std::nullopt_t) constructor
could be deprecated or removed.
2024-09-19 16:16:38 +01:00
Simon Pilgrim
f673882323
[X86] Allow speculative BSR/BSF instructions on targets with CMOV (#102885)
Currently targets without LZCNT/TZCNT won't speculate with BSR/BSF instructions in case they have a zero value input, meaning we always insert a test+branch for the zero-input case.

This patch proposes we allow speculation if the target has CMOV, and perform a branchless select instead to handle the zero input case. This will predominately help x86-64 targets where we haven't set any particular cpu target. We already always perform BSR/BSF instructions if we were lowering a CTLZ/CTTZ_ZERO_UNDEF instruction.
2024-08-22 11:11:00 +01:00
Simon Pilgrim
254da5ab8b [CostModel][X86] Add missing costkinds for scalar CTLZ/CTTZ instructions
Baed off worst case llvm-mca numbers for CTLZ/CTTZ(+ZERO_UNDEF) codegen

Prep work for #102885
2024-08-20 15:26:04 +01:00
Simon Pilgrim
3ef9220811 [CostModel][X86] Add missing AVX512 vector mul overflow intrinsic costs
Fix regressions in #100519
2024-07-29 15:53:42 +01:00
Simon Pilgrim
f2a0f97f7e [CostModel][X86] Improve vector mul overflow intrinsic costs 2024-07-26 14:34:13 +01:00
Simon Pilgrim
010dcfd85f [CostModel][X86] Improve add/sub/mul overflow intrinsic costs
Noticed due to x86 changes in #97463
2024-07-25 16:01:05 +01:00
Simon Pilgrim
b2b68c241a [CostModel][X86] Add add/sat sat intrinsic costs
Fixes regressions from #97463 due to missing costs for custom lowered ops
2024-07-25 14:35:39 +01:00
Tianqing Wang
3d494bfc7f
[SimplifyCFG] Increase budget for FoldTwoEntryPHINode() if the branch is unpredictable. (#98495)
The `!unpredictable` metadata has been present for a long time, but
it's usage in optimizations is still limited. This patch teaches
`FoldTwoEntryPHINode()` to be more aggressive with an unpredictable
branch to reduce mispredictions.

A TTI interface `getBranchMispredictPenalty()` is added to distinguish
between different hardwares to ensure we don't go too far for simpler
cores. For simplicity, only a naive x86 implementation is included for
the time being.
2024-07-23 07:47:21 +08:00
Shengchen Kan
a48305e0f9 [X86][CodeGen] Convert masked.load/store to CLOAD/CSTORE node only when vector size = 1
This fixes the crash when building llvm-test-suite with avx512f + cf.
2024-07-05 15:50:21 +08:00
Shengchen Kan
15fc801cf0
[X86][CodeGen] Support hoisting load/store with conditional faulting (#96720)
1. Add TTI interface for conditional load/store.
2. Mark 1 x i16/i32/i64 masked load/store legal so that it's not
   legalized in pass scalarize-masked-mem-intrin.
3. Visit 1 x i16/i32/i64 masked load/store to build a target-specific
   CLOAD/CSTORE node to avoid error in
   `DAGTypeLegalizer::ScalarizeVectorResult`.
4. Combine DAG to simplify the nodes for CLOAD/CSTORE.
5. Lower CLOAD/CSTORE to CFCMOV by pattern match.

This is CodeGen part of #95515
2024-06-27 17:01:55 +08:00
Simon Pilgrim
a46a2c2b7d
[X86] Lower vXi8 multiplies using PMADDUBSW on SSSE3+ targets (#95690)
Extends https://github.com/llvm/llvm-project/pull/95403 to handle non-constant cases - we can avoid unpacks/extensions from vXi8 to vXi16 by using PMADDUBSW instead and truncating the vXi16 results back together.

Most targets benefit from performing this for non-constant cases - its just Intel Core/SandyBridge era CPUs that might experience additional Port0/15 contention (but lower instruction count).

Fixes https://github.com/llvm/llvm-project/issues/90748
2024-06-25 12:25:56 +01:00
Nikita Popov
f2f18459d4 Revert "Intrinsic: introduce minimumnum and maximumnum (#93841)"
As far as I can tell, this pull request was not approved, and
did not go through an RFC on discourse.

This reverts commit 89881480030f48f83af668175b70a9798edca2fb.
This reverts commit 225d8fc8eb24fb797154c1ef6dcbe5ba033142da.
2024-06-21 08:34:04 +02:00
YunQiang Su
8988148003
Intrinsic: introduce minimumnum and maximumnum (#93841)
Currently, on different platform, the behaivor of llvm.minnum is
different if one operand is sNaN:

When we compare sNaN vs NUM:

ARM/AArch64/PowerPC: follow the IEEE754-2008's minNUM: return qNaN.
RISC-V/Hexagon follow the IEEE754-2019's minimumNumber: return NUM. X86:
Returns NUM but not same with IEEE754-2019's minimumNumber as
     +0.0 is not always greater than -0.0.
MIPS/LoongArch/Generic: return NUM.
LIBCALL: returns qNaN.

So, let's introduce llvm.minmumnum/llvm.maximumnum, which always follow
IEEE754-2019's minimumNumber/maximumNumber.

Half-fix: #93033
2024-06-21 11:53:08 +08:00
Simon Pilgrim
fe3f8ad8cc [X86] getIntrinsicInstrCost - begin generalizing BSWAP load/store-folding handling.
Move load/store folding 'free costs' inside the adjustTableCost helper so we can some additional intrinsics in the future.

The plan is to do something similar for other costs callbacks as well (getArithmeticInstrCost etc.).
2024-06-17 18:01:12 +01:00
Simon Pilgrim
22530e7985 [CostModel][X86] Update vXi8 mul costs for AVX512BW/AVX2/AVX1/SSE
Later levels were inheriting some of the worst case costs from SSE/AVX1 etc.

Based off llvm-mca numbers from the check_cost_tables.py script in https://github.com/RKSimon/llvm-scripts

Cleanup prep work for #90748
2024-06-16 07:27:35 +01:00
Simon Pilgrim
995ba4afcd [CostModel][X86] Adjust ABS scalar SizeLatency cost to 3uops
This was previously set to 4uops which was including the cost of extra register moves in the original test code.
2024-06-11 10:29:18 +01:00
Shengchen Kan
e282118f47 [X86][TTI] Update the return value of X86TTIImpl::getNumberOfRegisters for EGPR 2024-06-05 13:45:01 +08:00
Jan Patrick Lehr
3082258d3a
[CodeGen][X86] Use TargetLowering for TypeInfo of PointerTy (#93469)
This uses the TargetLowering getSimpleValueType mechanism to retrieve
the ValueType info inside the X86 cost model.

This resolves a build issue we were seeing for the miniQMC application after
https://github.com/llvm/llvm-project/pull/92671.
2024-05-29 14:42:48 +02:00
Simon Pilgrim
e8877b29a3 [X86] getGatherScatterOpCost- remove unnecessary extra brackets. NFC. 2024-05-28 10:03:04 +01:00
Simon Pilgrim
1a4b113a41 [CostModel][X86] getCastInstrCost - update cost tables to support CostKinds
Add TypeConversionCostKindTblEntry to hold the costs kinds and update the cast tables to take the existing default codesize/latency/sizelatency values (I'll update these values in future commits).

I've moved AdjustCost to the end of the function to ensure we don't accidentally use it, apart from when we fallback to default cost calculations.
2024-05-13 13:44:32 +01:00
Simon Pilgrim
502e77df1f [CostModel][X86] getGSVectorCost - remove FIXME
getGSVectorCost has supported other TargetCostKind since a55127281b2ed5f24f848b9e5c70870ad170bc3f
2024-05-12 15:24:04 +01:00
Simon Pilgrim
a477004d82 [CostModel][X86] Remove getGSScalarCost and use getCommonMaskedMemoryOpCost directly
The generic getCommonMaskedMemoryOpCost now gives the same cost estimates for scalarized gather/scatter.
2024-05-12 15:21:08 +01:00
Simon Pilgrim
23fe1fc6b7 [TTI][X86] getGSScalarCost - don't bother with adding cost of ICMP for each i1 mask element
These can nearly always be folded into the existing cost of the branch, and brings the throughput costs of the scalarised gather/scatter code much closer to the llvm-mca/uica estimates
2024-05-11 13:37:48 +01:00
Graham Hunter
2e8d815596
[TTI] Support scalable offsets in getScalingFactorCost (#88113)
Part of the work to support vscale-relative immediates in LSR.
2024-05-10 11:22:11 +01:00
Simon Pilgrim
a55127281b [CostModel][X86] getGSVectorCost - add cost kind support
Don't just assume gather/scatter non-throughput costs are 1 - latency and sizelatency (#uops) costs will be high, and codesize (#instructions) needs to account splitting.
2024-05-08 16:24:39 +01:00
Simon Pilgrim
abd314938d
[X86] Use GFNI for vXi8 shifts/rotates (#89115)
As detailed here: https://github.com/InstLatx64/InstLatX64_Demo/blob/master/GFNI_Demo.h

We can use the gf2p8affine instruction to lower byte shifts/rotates as well as the existing bitreverse case.

Based off the original patch here: https://reviews.llvm.org/D137026
2024-05-07 10:28:55 +01:00
Simon Pilgrim
8a0073ad46
[CostModel][X86] Treat lrint/llrint as fptosi calls (#90883)
X86 can use the CVTP2SI instructions to lower lrint/llrint calls, which have the same costs as the CVTTP2SI (fptosi) instructions

Followup to #90065
2024-05-03 18:06:50 +01:00
Simon Pilgrim
fc7e74e879 [CostModel][X86] getCastInstrCost - improve CostKind adjustment when splitting src/dst types
Noticed in #90883 review - for non-Throughput costs, we weren't applying the split count to '0 or 1' cost value.

This still doesn't work well as many of the type legalizations are hidden so we don't have the split count, really we need to move a CostKindCosts based costs table, but that's going to be a lot of work :/
2024-05-03 12:17:18 +01:00
Simon Pilgrim
f89f670d92 [CostModel][X86] Broadcast shuffles can be free if they are from a one-use load
AVX1+ can handle 32/64-bit broadcast loads, AVX2+ can handle all broadcast loads (we should be able to improve isLegalBroadcastLoad to handle more of this type matching).
2024-04-23 11:11:15 +01:00
Nikita Popov
8521550896 [X86] Use m_APIntAllowPoison instead of m_APIntAllowUndef
Fix build after 1baa3850656382d1d549a13f8a716ef5dc886eb8.
2024-04-18 15:56:03 +09:00
Simon Pilgrim
c02ed29ec1 [CostModel][X86] Recognise vector rotation by uniform constant patterns
Adds suitable costs for AVX512 targets (we still rely on default expansion for AVX2 and earlier)
2024-04-17 19:08:36 +01:00
Simon Pilgrim
e49043512d [CostModel][X86] Update BITREVERSE costs for GFNI targets
Inspired by the recent patches by @shamithoke - we have real scheduler model numbers for GFNI instructions now, allowing us to calculate an upper bounds costs table instead of performing it analytically.
2024-04-17 15:46:38 +01:00
David Green
4ac2721e51
[AArch64] Add costs for ST3 and ST4 instructions, modelled as store(shuffle). (#87934)
This tries to add some costs for the shuffle in a ST3/ST4 instruction,
which are represented in LLVM IR as store(interleaving shuffle). In
order to detect the store, it needs to add a CxtI context instruction to
check the users of the shuffle. LD3 and LD4 are added, LD2 should be a
zip1 shuffle, which will be added in another patch.

It should help fix some of the regressions from #87510.
2024-04-09 16:36:08 +01:00
Simon Pilgrim
f09f9bc0aa [X86] Add TODO to remove getGSScalarCost and use BaseT / getCommonMaskedMemoryOpCost directly
There are only a few differences in the use of AddressSpace and getScalarizationOverhead that need to be handled.
2024-04-05 13:16:27 +01:00
Simon Pilgrim
1b4c37fec2 [TTI][X86] getGSVectorCost/getGSScalarCost - add CostKind to the function arguments.
Initial refactor - only getGSScalarCost can actually use CostKind so far, and currently both are only ever set to TCK_RecipThroughput.
2024-04-05 11:15:46 +01:00
Simon Pilgrim
ed41249498 [CostModel][X86] Update AVX1 sext v4i1 -> v4i64 cost based off worst case llvm-mca numbers
We were using raw instruction count which overestimated the costs for #67803
2024-04-04 17:17:55 +01:00
Simon Pilgrim
3871eaba6b [CostModel][X86] Update AVX1 sext v8i1 -> v8i32 cost based off worst case llvm-mca numbers
We were using raw instruction count which overestimated the costs for #67803
2024-04-04 12:26:35 +01:00
Il-Capitano
308ed0233a
[Intrinsics] Make patchpoint.i64 generic on its return type (#85911)
Currently patchpoints can only have two result types, `void` and `i64`.
This limits the result to general purpose registers.
This patch makes `patchpoint.i64` an overloadable intrinsic, allowing
result values that can fit in a single register (e.g. integers,
pointers, floats).
2024-03-26 19:08:52 +05:30
Simon Pilgrim
ee5e027cc6 [X86] getShuffleCost - recognise concat_vector(X,Y) shuffle as InsertSubvector instead of PermuteTwoSrc
We don't have a concat_vector shuffle kind and improveShuffleKindFromMask won't alter the base type to match it as InsertSubvector.

But since this is how X86 will lower concat_vector anyhow, just recognise it explicitly.

Another step for #67803
2024-03-21 09:29:39 +00:00
Kolya Panchenko
889d99a50f
[TTI] Add alignment argument to TTI for compress/expand support (#83516)
Since `llvm.compressstore` and `llvm.expandload` do require memory
access, it's essential for some target to check if alignment is good to
be able to lower them to target-specific instructions
2024-03-05 20:33:56 -05:00
Nikita Popov
e84182af91
[X86][Inline] Skip inline asm in inlining target feature check (#83820)
When inlining across functions with different target features, we
perform roughly two checks:
 1. The caller features must be a superset of the callee features.
2. Calls in the callee cannot use types where the target features would
change the call ABI (e.g. by changing whether something is passed in a
zmm or two ymm registers). The latter check is very crude right now.

The latter check currently also catches inline asm "calls". I believe
that inline asm should be excluded from this check, as it is independent
from the usual call ABI, and instead governed by the inline asm
constraint string.

Fixes https://github.com/llvm/llvm-project/issues/67054.
2024-03-05 14:21:33 +01:00
Simon Pilgrim
9978f6a10f [CostModel][X86] Reduce the extra costs for ICMP complex predicates when an operand is constant
In most cases, SETCC lowering will be able to simplify/commute the comparison by adjusting the constant.

TODO: We still need to adjust ExtraCost based on CostKind

Fixes #80122
2024-02-21 16:19:39 +00:00