The `!unpredictable` metadata has been present for a long time, but
it's usage in optimizations is still limited. This patch teaches
`FoldTwoEntryPHINode()` to be more aggressive with an unpredictable
branch to reduce mispredictions.
A TTI interface `getBranchMispredictPenalty()` is added to distinguish
between different hardwares to ensure we don't go too far for simpler
cores. For simplicity, only a naive x86 implementation is included for
the time being.
1. Add TTI interface for conditional load/store.
2. Mark 1 x i16/i32/i64 masked load/store legal so that it's not
legalized in pass scalarize-masked-mem-intrin.
3. Visit 1 x i16/i32/i64 masked load/store to build a target-specific
CLOAD/CSTORE node to avoid error in
`DAGTypeLegalizer::ScalarizeVectorResult`.
4. Combine DAG to simplify the nodes for CLOAD/CSTORE.
5. Lower CLOAD/CSTORE to CFCMOV by pattern match.
This is CodeGen part of #95515
Extends https://github.com/llvm/llvm-project/pull/95403 to handle non-constant cases - we can avoid unpacks/extensions from vXi8 to vXi16 by using PMADDUBSW instead and truncating the vXi16 results back together.
Most targets benefit from performing this for non-constant cases - its just Intel Core/SandyBridge era CPUs that might experience additional Port0/15 contention (but lower instruction count).
Fixes https://github.com/llvm/llvm-project/issues/90748
As far as I can tell, this pull request was not approved, and
did not go through an RFC on discourse.
This reverts commit 89881480030f48f83af668175b70a9798edca2fb.
This reverts commit 225d8fc8eb24fb797154c1ef6dcbe5ba033142da.
Currently, on different platform, the behaivor of llvm.minnum is
different if one operand is sNaN:
When we compare sNaN vs NUM:
ARM/AArch64/PowerPC: follow the IEEE754-2008's minNUM: return qNaN.
RISC-V/Hexagon follow the IEEE754-2019's minimumNumber: return NUM. X86:
Returns NUM but not same with IEEE754-2019's minimumNumber as
+0.0 is not always greater than -0.0.
MIPS/LoongArch/Generic: return NUM.
LIBCALL: returns qNaN.
So, let's introduce llvm.minmumnum/llvm.maximumnum, which always follow
IEEE754-2019's minimumNumber/maximumNumber.
Half-fix: #93033
Move load/store folding 'free costs' inside the adjustTableCost helper so we can some additional intrinsics in the future.
The plan is to do something similar for other costs callbacks as well (getArithmeticInstrCost etc.).
Later levels were inheriting some of the worst case costs from SSE/AVX1 etc.
Based off llvm-mca numbers from the check_cost_tables.py script in https://github.com/RKSimon/llvm-scripts
Cleanup prep work for #90748
This uses the TargetLowering getSimpleValueType mechanism to retrieve
the ValueType info inside the X86 cost model.
This resolves a build issue we were seeing for the miniQMC application after
https://github.com/llvm/llvm-project/pull/92671.
Add TypeConversionCostKindTblEntry to hold the costs kinds and update the cast tables to take the existing default codesize/latency/sizelatency values (I'll update these values in future commits).
I've moved AdjustCost to the end of the function to ensure we don't accidentally use it, apart from when we fallback to default cost calculations.
These can nearly always be folded into the existing cost of the branch, and brings the throughput costs of the scalarised gather/scatter code much closer to the llvm-mca/uica estimates
Don't just assume gather/scatter non-throughput costs are 1 - latency and sizelatency (#uops) costs will be high, and codesize (#instructions) needs to account splitting.
Noticed in #90883 review - for non-Throughput costs, we weren't applying the split count to '0 or 1' cost value.
This still doesn't work well as many of the type legalizations are hidden so we don't have the split count, really we need to move a CostKindCosts based costs table, but that's going to be a lot of work :/
AVX1+ can handle 32/64-bit broadcast loads, AVX2+ can handle all broadcast loads (we should be able to improve isLegalBroadcastLoad to handle more of this type matching).
Inspired by the recent patches by @shamithoke - we have real scheduler model numbers for GFNI instructions now, allowing us to calculate an upper bounds costs table instead of performing it analytically.
This tries to add some costs for the shuffle in a ST3/ST4 instruction,
which are represented in LLVM IR as store(interleaving shuffle). In
order to detect the store, it needs to add a CxtI context instruction to
check the users of the shuffle. LD3 and LD4 are added, LD2 should be a
zip1 shuffle, which will be added in another patch.
It should help fix some of the regressions from #87510.
Currently patchpoints can only have two result types, `void` and `i64`.
This limits the result to general purpose registers.
This patch makes `patchpoint.i64` an overloadable intrinsic, allowing
result values that can fit in a single register (e.g. integers,
pointers, floats).
We don't have a concat_vector shuffle kind and improveShuffleKindFromMask won't alter the base type to match it as InsertSubvector.
But since this is how X86 will lower concat_vector anyhow, just recognise it explicitly.
Another step for #67803
Since `llvm.compressstore` and `llvm.expandload` do require memory
access, it's essential for some target to check if alignment is good to
be able to lower them to target-specific instructions
When inlining across functions with different target features, we
perform roughly two checks:
1. The caller features must be a superset of the callee features.
2. Calls in the callee cannot use types where the target features would
change the call ABI (e.g. by changing whether something is passed in a
zmm or two ymm registers). The latter check is very crude right now.
The latter check currently also catches inline asm "calls". I believe
that inline asm should be excluded from this check, as it is independent
from the usual call ABI, and instead governed by the inline asm
constraint string.
Fixes https://github.com/llvm/llvm-project/issues/67054.
In most cases, SETCC lowering will be able to simplify/commute the comparison by adjusting the constant.
TODO: We still need to adjust ExtraCost based on CostKind
Fixes#80122
extract subvector.
Many targets do not have cost for extractsubvector shuffle kind, but
have the costs for single source permute. If there are no costs
estimation for extractsubvector, better to switchto single source
permute for better cost estimation.
Reviewers: RKSimon, davemgreen, arsenm
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/79837
SLP/TTI do not know about the cost estimation for addsub pattern,
supported by X86. Previously the support for pattern detection was added
(seeTTI::isLegalAltInstr), but the cost still did not estimated
properly.
SLP/TTI do not know about the cost estimation for addsub pattern,
supported by X86. Previously the support for pattern detection was added
(seeTTI::isLegalAltInstr), but the cost still did not estimated
properly.
It seems TypeSize is currently broken in the sense that:
TypeSize::Fixed(4) + TypeSize::Scalable(4) => TypeSize::Fixed(8)
without failing its assert that explicitly tests for this case:
assert(LHS.Scalable == RHS.Scalable && ...);
The reason this fails is that `Scalable` is a static method of class
TypeSize,
and LHS and RHS are both objects of class TypeSize. So this is
evaluating
if the pointer to the function Scalable == the pointer to the function
Scalable,
which is always true because LHS and RHS have the same class.
This patch fixes the issue by renaming `TypeSize::Scalable` ->
`TypeSize::getScalable`, as well as `TypeSize::Fixed` to
`TypeSize::getFixed`,
so that it no longer clashes with the variable in
FixedOrScalableQuantity.
The new methods now also better match the coding standard, which
specifies that:
* Variable names should be nouns (as they represent state)
* Function names should be verb phrases (as they represent actions)
Need to add NumSrcElts param to is..Mask functions in
ShuffleVectorInstruction class for better mask analysis. Mask.size() not
always matches the sizes of the permuted vector(s). Allows to better
estimate the cost in SLP and fix uses of the functions in other cases.
Differential Revision: https://reviews.llvm.org/D158449
Add initial half/bfloat broadcast shuffles test coverage (more to follow)
Fixes#68117 - which was stuck in a loop between getting scalarized insert/extract costs for the shuffle and then trying to convert a bfloat insert into a shuffle again......