llvm-project

Author	SHA1	Message	Date
Shih-Po Hung	b9bdec3021	[TTI][Vectorize] Migrate masked/gather-scatter/strided/expand-compress costing (NFCI) (#165532 ) In #160470, there is a discussion about the possibility to explored a general approach for handling memory intrinsics. API changes: - Remove getMaskedMemoryOpCost, getGatherScatterOpCost, getExpandCompressMemoryOpCost, getStridedMemoryOpCost from Analysis/TargetTransformInfo. - Add getMemIntrinsicInstrCost. In BasicTTIImpl, map intrinsic IDs to existing target implementation until the legacy TTI hooks are retired. - masked_load/store → getMaskedMemoryOpCost - masked_/vp_gather/scatter → getGatherScatterOpCost - masked_expandload/compressstore → getExpandCompressMemoryOpCost - experimental_vp_strided_{load,store} → getStridedMemoryOpCost TODO: add support for vp_load_ff. No functional change intended; costs continue to route to the same target-specific hooks.	2025-11-28 05:14:37 +00:00
Shih-Po Hung	f40c694713	[TTI] Use MemIntrinsicCostAttributes for getExpandCompressMemoryOpCost (#168677 ) - Following #168029. This is a step toward a unified interface for masked/gather-scatter/strided/expand-compress cost modeling. - Replace the ad-hoc parameter list with a single attributes object. API change: ``` - InstructionCost getExpandCompressMemoryOpCost(Opcode, DataTy, - VariableMask, Alignment, - CostKind, Inst); + InstructionCost getExpandCompressMemoryOpCost(MemIntrinsicCostAttributes, + CostKind); ``` Notes: - NFCI intended: callers populate MemIntrinsicCostAttributes with same information as before.	2025-11-28 04:19:23 +00:00
Drew Kersnar	17852deda7	[NVPTX] Lower LLVM masked vector loads and stores to PTX (#159387 ) This backend support will allow the LoadStoreVectorizer, in certain cases, to fill in gaps when creating load/store vectors and generate LLVM masked load/stores (https://llvm.org/docs/LangRef.html#llvm-masked-store-intrinsics). To accomplish this, changes are separated into two parts. This first part has the backend lowering and TTI changes, and a follow up PR will have the LSV generate these intrinsics: https://github.com/llvm/llvm-project/pull/159388. In this backend change, Masked Loads get lowered to PTX with `#pragma "used_bytes_mask" [mask];` (https://docs.nvidia.com/cuda/parallel-thread-execution/#pragma-strings-used-bytes-mask). And Masked Stores get lowered to PTX using the new sink symbol syntax (https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-st). # TTI Changes TTI changes are needed because NVPTX only supports masked loads/stores with _constant_ masks. `ScalarizeMaskedMemIntrin.cpp` is adjusted to check that the mask is constant and pass that result into the TTI check. Behavior shouldn't change for non-NVPTX targets, which do not care whether the mask is variable or constant when determining legality, but all TTI files that implement these API need to be updated. # Masked store lowering implementation details If the masked stores make it to the NVPTX backend without being scalarized, they are handled by the following: * `NVPTXISelLowering.cpp` - Sets up a custom operation action and handles it in lowerMSTORE. Similar handling to normal store vectors, except we read the mask and place a sentinel register `$noreg` in each position where the mask reads as false. For example, ``` t10: v8i1 = BUILD_VECTOR Constant:i1<-1>, Constant:i1<0>, Constant:i1<0>, Constant:i1<-1>, Constant:i1<-1>, Constant:i1<0>, Constant:i1<0>, Constant:i1<-1> t11: ch = masked_store<(store unknown-size into %ir.lsr.iv28, align 32, addrspace 1)> t5:1, t5, t7, undef:i64, t10 -> STV_i32_v8 killed %13:int32regs, $noreg, $noreg, killed %16:int32regs, killed %17:int32regs, $noreg, $noreg, killed %20:int32regs, 0, 0, 1, 8, 0, 32, %4:int64regs, 0, debug-location !18 :: (store unknown-size into %ir.lsr.iv28, align 32, addrspace 1); ``` * `NVPTXInstInfo.td` - changes the definition of store vectors to allow for a mix of sink symbols and registers. * `NVPXInstPrinter.h/.cpp` - Handles the `$noreg` case by printing "_". # Masked load lowering implementation details Masked loads are routed to normal PTX loads, with one difference: a `#pragma "used_bytes_mask"` is emitted before the load instruction (https://docs.nvidia.com/cuda/parallel-thread-execution/#pragma-strings-used-bytes-mask). To accomplish this, a new operand is added to every NVPTXISD Load type representing this mask. * `NVPTXISelLowering.h/.cpp` - Masked loads are converted into normal NVPTXISD loads with a mask operand in two ways. 1) In type legalization through replaceLoadVector, which is the normal path, and 2) through LowerMLOAD, to handle the legal vector types (v2f16/v2bf16/v2i16/v4i8/v2f32) that will not be type legalized. Both share the same convertMLOADToLoadWithUsedBytesMask helper. Both default this operand to UINT32_MAX, representing all bytes on. For the latter, we need a new `NVPTXISD::MLoadV1` type to represent that edge case because we cannot put the used bytes mask operand on a generic LoadSDNode. * `NVPTXISelDAGToDAG.cpp` - Extract used bytes mask from loads, add them to created machine instructions. * `NVPTXInstPrinter.h/.cpp` - Print the pragma when the used bytes mask isn't all ones. * `NVPTXForwardParams.cpp`, `NVPTXReplaceImageHandles.cpp` - Update manual indexing of load operands to account for new operand. * `NVPTXInsrtInfo.td`, `NVPTXIntrinsics.td` - Add the used bytes mask to the MI definitions. * `NVPTXTagInvariantLoads.cpp` - Ensure that masked loads also get tagged as invariant. Some generic changes that are needed: * `LegalizeVectorTypes.cpp` - Ensure flags are preserved when splitting masked loads. * `SelectionDAGBuilder.cpp` - Preserve `MD_invariant_load` on masked load SDNode creation	2025-11-25 10:26:15 -06:00
Phoebe Wang	28eee722aa	[GlobalOpt] Add TTI interface useFastCCForInternalCall for FASTCC (#164768 ) Background: X86 APX feature adds 16 registers within the same 64-bit mode. PR #164638 is trying to extend such registers for FASTCC. However, a blocker issue is calling convention cannot be changeable with or without a feature. The solution is to disable FASTCC if APX is not ready. This is an NFC change to the final code generation, becasue X86 doesn't define an alternative ABI for FASTCC in 64-bit mode. We can solve the potential compatibility issue of #164638 with this patch.	2025-11-24 08:47:52 +08:00
Shih-Po Hung	961940e1a7	[TTI] Use MemIntrinsicCostAttributes for getMaskedMemoryOpCost (#168029 ) - Split from #165532. This is a step toward a unified interface for masked/gather-scatter/strided/expand-compress cost modeling. - Replace the ad-hoc parameter list with a single attributes object. API change: ``` - InstructionCost getMaskedMemoryOpCost(Opcode, Src, Alignment, - AddressSpace, CostKind); + InstructionCost getMaskedMemoryOpCost(MemIntrinsicCostAttributes, + CostKind); ``` Notes: - NFCI intended: callers populate MemIntrinsicCostAttributes with the same information as before. - Follow-up: migrate gather/scatter, strided, and expand/compress cost queries to the same attributes-based entry point.	2025-11-19 09:51:12 +08:00
Kazu Hirata	31b8ba5670	[Analysis, CodeGen] Use ArrayRef instead of const ArrayRef (NFC) (#166026 ) This patch improves readability by using "ArrayRef<T>" instead of "const ArrayRef<T>" and "const ArrayRef<T> &" in function parameter types.	2025-11-01 23:20:19 -07:00
Sam Tebbs	6b19a546aa	[LV] Bundle partial reductions inside VPExpressionRecipe (#147302 ) This PR bundles partial reductions inside the VPExpressionRecipe class. Stacked PRs: 1. https://github.com/llvm/llvm-project/pull/147026 2. https://github.com/llvm/llvm-project/pull/147255 3. https://github.com/llvm/llvm-project/pull/156976 4. https://github.com/llvm/llvm-project/pull/160154 5. -> https://github.com/llvm/llvm-project/pull/147302 6. https://github.com/llvm/llvm-project/pull/162503 7. https://github.com/llvm/llvm-project/pull/147513	2025-10-23 11:18:55 +00:00
ShihPo Hung	2be906b255	Revert "[TTI][RISCV] Add cost modelling for intrinsic vp.load.ff (#160470 )" This reverts commit aa08b1a9963f33ded658d3ee655429e1121b5212.	2025-09-26 17:24:56 -07:00
Shih-Po Hung	aa08b1a996	[TTI][RISCV] Add cost modelling for intrinsic vp.load.ff (#160470 ) Split out from #151300 to isolate TargetTransformInfo cost modelling for fault-only-first loads from VPlan implementation details. This change adds costing support for vp.load.ff independently of the VPlan work. For now, model a vp.load.ff as cost-equivalent to a vp.load.	2025-09-26 16:47:10 +08:00
Hank Chang	f8e51df8e5	[TTI][ASan][RISCV] reland Move InterestingMemoryOperand to Analysis and embed in MemIntrinsicInfo #157863 (#159713 ) [Previously reverted due to failures on asan-rvv-intrinsics.ll, the test case is riscv only and it is triggered by other target] Reland [#157863](https://github.com/llvm/llvm-project/pull/157863), and add `; REQUIRES: riscv-registered-target` in test case to skip the configuration that doesn't register riscv target. Previously asan considers target intrinsics as black boxes, so asan could not instrument accurate check. This patch make SmallVector<InterestingMemoryOperand> a member of MemIntrinsicInfo so that TTI can make targets describe their intrinsic informations to asan. Note, 1. This patch move InterestingMemoryOperand from Transforms to Analysis. 2. Extend MemIntrinsicInfo by adding a SmallVector<InterestingMemoryOperand> member. 3. This patch does not support RVV indexed/segment load/store.	2025-09-23 09:42:58 +08:00
Florian Mayer	48f804d609	Revert "[TTI][ASan][RISCV] Move InterestingMemoryOperand to Analysis and embed in MemIntrinsicInfo" (#159700 ) Reverts llvm/llvm-project#157863	2025-09-18 21:13:04 -07:00
Hank Chang	f8b7f64ad2	[TTI][ASan][RISCV] Move InterestingMemoryOperand to Analysis and embed in MemIntrinsicInfo (#157863 ) Previously asan considers target intrinsics as black boxes, so asan could not instrument accurate check. This patch make SmallVector<InterestingMemoryOperand> a member of MemIntrinsicInfo so that TTI can make targets describe their intrinsic informations to asan. Note, 1. This patch move InterestingMemoryOperand from Transforms to Analysis. 2. Extend MemIntrinsicInfo by adding a SmallVector<InterestingMemoryOperand> member. 3. This patch does not support RVV indexed/segment load/store.	2025-09-19 11:09:41 +08:00
Luke Lau	4bb250d6a3	[VPlan] Always consider register pressure on RISC-V (#156951 ) Stacked on #156923 In https://godbolt.org/z/8svWaredK, we spill a lot on RISC-V because whilst the largest element type is i8, we generate a bunch of pointer vectors for gathers and scatters. This means the VF chosen is quite high e.g. <vscale x 16 x i8>, but we end up using a bunch of <vscale x 16 x i64> m8 registers for the pointers. This was briefly fixed by #132190 where we computed register pressure in VPlan and used it to prune VFs that were likely to spill. The legacy cost model wasn't able to do this pruning because it didn't have visibility into the pointer vectors that were needed for the gathers/scatters. However VF pruning was restricted again to just the case when max bandwidth was enabled in #141736 to avoid an AArch64 regression, and restricted again in #149056 to only prune VFs that had max bandwidth enabled. On RISC-V we take advantage of register grouping for performance and choose a default of LMUL 2, which means there are 16 registers to work with – half the number as SVE, so we encounter higher register pressure more frequently. As such, we likely want to always consider pruning VFs with high register pressure and not just the VFs from max bandwidth. This adds a TTI hook to opt into this behaviour for RISC-V which fixes the motivating godbolt example above. When last checked this significantly reduces the number of spills on SPEC CPU 2017, up to 80% on 538.imagick_r.	2025-09-12 06:21:54 +00:00
Hassnaa Hamdi	35b22764e2	[LV][AArch64] Prefer epilogue with fixed-width over scalable VF. (#155546 ) In case of equal costs Prefer epilogue with fixed-width over scalable VF. That is helpful in cases like post-LTO vectorization where epilogue with fixed-width VF can be removed when we eventually know that the trip count is less than the epilogue iterations.	2025-09-04 19:31:30 +01:00
Sam Tebbs	37127f74f4	[LV] Bundle sub reductions into VPExpressionRecipe (#147255 ) This PR bundles sub reductions into the VPExpressionRecipe class and adjusts the cost functions to take the negation into account. Stacked PRs: 1. https://github.com/llvm/llvm-project/pull/147026 2. -> https://github.com/llvm/llvm-project/pull/147255 3. https://github.com/llvm/llvm-project/pull/147302 4. https://github.com/llvm/llvm-project/pull/147513	2025-09-01 17:25:01 +01:00
Florian Hahn	4e6c88be7c	[TTI] Remove Args argument from getOperandsScalarizationOverhead (NFC). (#154126 ) Remove the ArrayRef<const Value> Args operand from getOperandsScalarizationOverhead and require that the callers de-duplicate arguments and filter constant operands. Removing the Value based Args argument enables callers where no Value * operands are available to use the function in a follow-up: computing the scalarization cost directly for a VPlan recipe. It also allows more accurate cost-estimates in the future: for example, when vectorizing a loop, we could also skip operands that are live-ins, as those also do not require scalarization. PR: https://github.com/llvm/llvm-project/pull/154126	2025-08-20 21:09:08 +01:00
David Sherwood	13d8ba7dea	[LV][TTI] Calculate cost of extracting last index in a scalable vector (#144086 ) There are a couple of places in the loop vectoriser where we want to calculate the cost of extracting the last lane in a vector. However, we wrongly assume that asking for the cost of extracting lane (VF.getKnownMinValue() - 1) is an accurate representation of the cost of extracting the last lane. For SVE at least, this is non-trivial as it requires the use of whilelo and lastb instructions. To solve this problem I have added a new getReverseVectorInstrCost interface where the index is used in reverse from the end of the vector. Suppose a vector has a given ElementCount EC, the extracted/inserted lane would be EC - 1 - Index. For scalable vectors this index is unknown at compile time. I've added a AArch64 hook that better represents the cost, and also a RISCV hook that maintains compatibility with the behaviour prior to this PR. I've also taken the liberty of adding support in vplan for calculating the cost of VPInstruction::ExtractLastElement.	2025-08-19 09:31:37 +01:00
Elvis Wang	01fac67e2a	[TTI] Add cost kind to getAddressComputationCost(). NFC. (#153342 ) This patch add cost kind to `getAddressComputationCost()` for #149955. Note that this patch also remove all the default value in `getAddressComputationCost()`.	2025-08-14 16:01:44 +08:00
Luke Lau	acb86fb9e0	[TTI] Consistently pass the pointer type to getAddressComputationCost. NFCI (#152657 ) In some places we were passing the type of value being accessed, in other cases we were passing the type of the pointer for the access. The most "involved" user is LoopVectorizationCostModel::getMemInstScalarizationCost, which is the only call site that passes in the SCEV, and it passes along the pointer type. This changes call sites to consistently pass the pointer type, and renames the arguments to clarify this. No target actually checks the contents of the type passed, only to see if it's a vector or not, so this shouldn't have an effect.	2025-08-11 18:00:12 +08:00
Nathan Gauër	67273393b1	[VectorCombine][TTI] Prevent extract/ins rewrite to GEP (#150216 ) Using GEP to index into a vector is not disallowed, but not recommended. The SPIR-V backend needs to generate structured access into types, which is impossible with an untyped GEP instruction unless we add more info to the IR. Finding a solution is a work-in-progress, but in the meantime, we'd like to reduce the amount of failures. Preventing this optimizations from rewritting extract/insert instructions into a GEP helps us lower more code to SPIR-V. This change should be OK as it's only active when targeting SPIR-V and disabling a non-recommended transformation. Related to #145002	2025-07-31 14:14:00 +02:00
Alexandros Lamprineas	3ab64c5b29	[NFC][Clang][FMV] Make FMV priority data type future proof. (#150079 ) FMV priority is the returned value of a polymorphic function. On RISC-V and X86 targets a 32-bit value is enough. On AArch64 we currently need 64 bits and we will soon exceed that. APInt seems to be a suitable replacement for uint64_t, presumably with minimal compile time overhead. It allows bit manipulation, comparison and variable bit width.	2025-07-23 10:37:29 +01:00
Florian Hahn	d3d77f71aa	[EarlyCSE,TTI] Don't create new, unused, instructions. (#134534 ) getOrCreateResultFromMemIntrinsic can modify the current function by inserting new instructions without EarlyCSE keeping track of the changes. Introduce a new CanCreate argument, and update the function to only create new instructions when CanCreate = true. Use it when appropriate. Fixes https://github.com/llvm/llvm-project/issues/145183	2025-07-08 21:43:51 +01:00
David Green	77941eba7f	[CostModel] Add a DstTy to getShuffleCost (#141634 ) A shuffle will take two input vectors and a mask, to produce a new vector of size <MaskElts x SrcEltTy>. Historically it has been assumed that the SrcTy and the DstTy are the same for getShuffleCost, with that being relaxed in recent years. If the Tp passed to getShuffleCost is the SrcTy, then the DstTy can be calculated from the Mask elts and the src elt size, but the Mask is not always provided and the Tp is not reliably always the SrcTy. This has led to situations notably in the SLP vectorizer but also in the generic cost routines where assumption about how vectors will be legalized are built into the generic cost routines - for example whether they will widen or promote, with the cost modelling assuming they will widen but the default lowering to promote for integer vectors. This patch attempts to start improving that - it originally tried to alter more of the cost model but that too quickly became too many changes at once, so this patch just plumbs in a DstTy to getShuffleCost so that DstTy and SrcTy can be reliably distinguished. The callers of getShuffleCost have been updated to try and include a DstTy that is more accurate. Otherwise it tries to be fairly non-functional, keeping the SrcTy used as the primary type used in shuffle cost routines, only using DstTy where it was in the past (for InsertSubVector for example). Some asserts have been added that help to check for consistent values when a Mask and a DstTy are provided to getShuffleCost. Some of them took a while to get right, and some non-mask calls might still be incorrect. Hopefully this will provide a useful base to build more shuffles that alter size.	2025-06-21 12:29:29 +01:00
Philip Reames	b96370131d	[TTI] Plumb CostKind through getPartialReductionCost (#144953 ) Purely for the sake of being idiomatic with other TTI costing routines, no direct motivation beyond that.	2025-06-19 15:29:56 -07:00
Florian Hahn	071a6feabd	[TTI] Remove PPC hasActiveVectorLength impl, simplify interface (NFC). (#142310 ) PPCTTIImpl defines hasActiveVectorLength and also getVPMemoryOpCost, but they appear unused (i.e. no changes to tests). Remove them, as they complicate the interface for hasActiveVectorLength. This simplifies the only use in LV as now no placeholder values need to be passed. PR: https://github.com/llvm/llvm-project/pull/142310	2025-06-18 19:02:17 +01:00
Luke Lau	64106581b9	[Constant] Make Constant::getSplatValue return poison on poison (#141870 ) This is a follow up from #141845. TargetTransformInfo::getOperandInfo needs to be updated to check for undef values as otherwise a splat is considered a constant, and some RISC-V cost model tests will start adding a cost to materialize the constant.	2025-05-29 15:01:28 +01:00
Mel Chen	688bccb290	[TTI][LV] Simplify the prototype of preferPredicatedReductionSelect. nfc (#139265 )	2025-05-12 17:24:37 +08:00
David Green	abd2c07e39	[CostModel] Make Op0 and Op1 const in getVectorInstrCost. NFC (#137631 ) This does not alter much at the moment, but allows const pointers to be passed as Op0 and Op1, simplifying later patches	2025-05-01 15:55:08 +01:00
Jonas Paulsson	f5c8c1eedb	[SLPVectorizer] Move X86 specific handling into X86TTIImpl. (#137830 ) `ad9909d "[SLP]Fix perfect diamond match with extractelements in scalars" ` changed SLPVectorizer getScalarizationOverhead() to call TTI.getVectorInstrCost() instead of TTI.getScalarizationOverhead() in some cases. This was due to X86 specific handlings in these (overridden) methods, and unfortunately the general preference of TTI.getScalarizationOverhead() was dropped. If VL is available it should always be preferred to use getScalarizationOverhead(), and this is indeed the case for SystemZ which has a special insertion instruction that can insert two GPR64s. Then ` 33af951 "[SLP]Synchronize cost of gather/buildvector nodes with codegen"` reworked SLPVectorizer getGatherCost() which together with ad9909d caused the SystemZ test vec-elt-insertion.ll to fail. This patch restores the SystemZ test and reverts the change in SLPVectorizer getScalarizationOverhead() so that TTI.getScalarizationOverhead() is always called again. The ForPoisonSrc argument is now passed on to the TTI method so that X86 can handle this as required. Fixes: #135346	2025-04-30 17:11:27 +02:00
Sergei Barannikov	bb1765179e	[TTI] Simplify implementation (NFCI) (#136674 ) Replace "concept based polymorphism" with simpler PImpl idiom. This pursues two goals: * Enforce static type checking. Previously, target implementations hid base class methods and type checking was impossible. Now that they override the methods, the compiler will complain on mismatched signatures. * Make the code easier to navigate. Previously, if you asked your favorite LSP server to show a method (e.g. `getInstructionCost()`), it would show you methods from `TTI`, `TTI::Concept`, `TTI::Model`, `TTIImplBase`, and target overrides. Now it is two less :) There are three commits to hopefully simplify the review. The first commit removes `TTI::Model`. This is done by deriving `TargetTransformInfoImplBase` from `TTI::Concept`. This is possible because they implement the same set of interfaces with identical signatures. The first commit makes `TargetTransformImplBase` polymorphic, which means all derived classes should `override` its methods. This is done in second commit to make the first one smaller. It appeared infeasible to extract this into a separate PR because the first commit landed separately would result in tons of `-Woverloaded-virtual` warnings (and break `-Werror` builds). The third commit eliminates `TTI::Concept` by merging it with the only derived class `TargetTransformImplBase`. This commit could be extracted into a separate PR, but it touches the same lines in `TargetTransformInfoImpl.h` (removes `override` added by the second commit and adds `virtual`), so I thought it may make sense to land these two commits together. Pull Request: https://github.com/llvm/llvm-project/pull/136674	2025-04-26 15:25:40 +03:00
David Green	d20604e5b6	[CostModel] Plumb CostKind into getExtractWithExtendCost (#135523 ) This will likely not affect much with the current uses of the function, but if we have getExtractWithExtendCost we can plumb CostKind through it in the same way as other costmodel functions.	2025-04-22 15:09:43 +01:00
Rahul Joshi	99e4b3927c	[LLVM] Cleanup pass initialization for Analysis passes (#135858 ) - Do not call pass initialization from pass constructors. - Instead, pass initialization should happen in the `initializeAnalysis` function. - https://github.com/llvm/llvm-project/issues/111767	2025-04-21 12:36:34 -07:00
Mel Chen	409df9f74c	[TTI][LV] Change the prototype of preferInLoopReduction. nfc (#132698 ) This patch changes the preferInLoopReduction function to take a RecurKind instead of an unsigned Opcode. This makes it possible to distinguish non-arithmetic reductions such as min/max, AnyOf, and FindLastIV, and also helps unify IAnyOf with FAnyOf and IFindLastIV with FFindLastIV. Related patch #118393 #131830	2025-04-07 19:10:16 +08:00
Krzysztof Drewniak	554859c736	[TTI] Make isLegalMasked{Load,Store} take an address space (#134006 ) In order to facilitate targets that only support masked loads/stores on certain address spaces (AMDGPU will support them in an upcoming patch, but only for address space 7), add an AddressSpace parameter to isLegalMaskedLoad and isLegalMaskedStore	2025-04-02 15:38:10 -05:00
pzzp	d6a2cca77e	[llvm:ir] Add support for constant data exceeding 4GiB (#126481 ) The test file is over 4GiB, which is too big, so I didn’t submit it.	2025-03-21 11:44:01 -07:00
Phoebe Wang	64555e3d48	[X86][NFCI] Add IsStore parameter to hasConditionalLoadStoreForType (#132153 ) Address https://github.com/llvm/llvm-project/pull/132032#issuecomment-2736936769	2025-03-20 18:25:09 +08:00
Elvis Wang	6dba5f6595	[TTI] Align optional FMFs in getExtendedReductionCost() to getArithmeticReductionCost(). (#131968 ) In the implementation of the getExtendedReductionCost(), it ofter calls getArithmeticReductionCost() with FMFs. But we shouldn't call getArithmeticReductionCost() with FMFs for non-floating-point reductions which will return the wrong cost. This patch makes FMFs in getExtendedReductionCost() optional and align to the getArithmeticReductionCost(). So the TTI will return the correct cost for non-FP extended-reductions query without FMFs. This patch is not quite NFC but it's hard to test from the CostModel side. Split from #113903.	2025-03-19 18:53:38 +08:00
Alexey Bataev	10085390c6	[SLP]Reduce number of alternate instruction, where possible Previous version was reviewed here https://github.com/llvm/llvm-project/pull/123360 It is mostly the same, adjusted after graph-to-tree transformation Patch tries to remove wide alternate operations. Currently SLP vectorizer emits something like this: ``` %0 = add i32 %1 = sub i32 %2 = add i32 %3 = sub i32 %4 = add i32 %5 = sub i32 %6 = add i32 %7 = sub i32 transformes to %v1 = add <8 x i32> %v2 = sub <8 x i32> %res = shuffle %v1, %v2, <0, 9, 2, 11, 4, 13, 6, 15> ``` i.e. half of the results are just unused. This leads to increased register pressure and potentially doubles number of operations. Patch introduces SplitVectorize mode, where it splits the operations by opcodes and produces instead something like this: ``` %v1 = add <4 x i32> %v2 = sub <4 x i32> %res = shuffle %v1, %v2, <0, 4, 1, 5, 2, 6, 3, 7> ``` It allows to improve the performance by reducing number of ops. Also, it turns on some other improvements, like improved graph reordering. -O3+LTO, AVX512 Metric: size..text Program size..text results results0 diff test-suite :: MultiSource/Benchmarks/Olden/tsp/tsp.test 2788.00 2820.00 1.1% test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test 278168.00 280904.00 1.0% test-suite :: MultiSource/Benchmarks/FreeBench/pifft/pifft.test 82682.00 83258.00 0.7% test-suite :: External/SPEC/CFP2006/433.milc/433.milc.test 139344.00 139712.00 0.3% test-suite :: MultiSource/Benchmarks/FreeBench/fourinarow/fourinarow.test 27149.00 27197.00 0.2% test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test 1008188.00 1009948.00 0.2% test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 39226.00 39290.00 0.2% test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 39229.00 39293.00 0.2% test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 2074533.00 2076549.00 0.1% test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 2074533.00 2076549.00 0.1% test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 798440.00 798952.00 0.1% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test 44123.00 44139.00 0.0% test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 318942.00 319038.00 0.0% test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 1159880.00 1160152.00 0.0% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniAMR/miniAMR.test 73595.00 73611.00 0.0% test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test 1146124.00 1146348.00 0.0% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/CLAMR/CLAMR.test 203831.00 203847.00 0.0% test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 207662.00 207678.00 0.0% test-suite :: External/SPEC/CFP2006/447.dealII/447.dealII.test 589851.00 589883.00 0.0% test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test 1398543.00 1398559.00 0.0% test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test 1398543.00 1398559.00 0.0% test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 2050990.00 2051006.00 0.0% test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12559687.00 12559591.00 -0.0% test-suite :: External/SPEC/CINT2006/403.gcc/403.gcc.test 3074157.00 3074125.00 -0.0% test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test 1092252.00 1092188.00 -0.0% test-suite :: External/SPEC/CFP2017rate/508.namd_r/508.namd_r.test 779763.00 779715.00 -0.0% test-suite :: MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000.test 253517.00 253485.00 -0.0% test-suite :: MultiSource/Applications/JM/lencod/lencod.test 848259.00 848035.00 -0.0% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test 93064.00 93016.00 -0.1% test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 383747.00 383475.00 -0.1% test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 673051.00 662907.00 -1.5% test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 673051.00 662907.00 -1.5% Olden/tsp - small variations Prolangs-C/TimberWolfMC - small variations, some code not inlined FreeBench/pifft - extra store <8 x double> vectorized, some other extra vectorizations CFP2006/433.milc - better vector code FreeBench/fourinarow - better vector code Benchmarks/tramp3d-v4 - extra vector code, small variations mediabench/gsm/toast - small variations MiBench/telecomm-gsm - small variations CINT2017rate/500.perlbench_r CINT2017speed/600.perlbench_s - better vector code, small variations CINT2006/464.h264ref - some smaller code + changes similar to x264 DOE-ProxyApps-C/miniGMG - small variations Benchmarks/Bullet - small variations CFP2017rate/511.povray_r - small variations DOE-ProxyApps-C/miniAMR - small variations CFP2006/453.povray - small variations DOE-ProxyApps-C++/CLAMR - small variations MiBench/consumer-lame - small variations CFP2006/447.dealII - small variations CFP2017rate/538.imagick_r CFP2017speed/638.imagick_s - small variations CFP2017rate/510.parest_r - better vector code, small variations CFP2017rate/526.blender_r - small variations CINT2006/403.gcc - small variations CINT2006/400.perlbench - small variations CFP2017rate/508.namd_r - small variations ASCI_Purple/SMG2000 - small variations JM/lencod - extra store <16 x i32>, small variations DOE-ProxyApps-C++/miniFE - small variations JM/ldecod - extra vector code, small variations, less shuffles CINT2017speed/625.x264_s CINT2017rate/525.x264_r - the number of instructions increased, but looks like they are more performant. E.g., for function x264_pixel_satd_8x8, llvm-mca reports better throughput - 84 for the current version and 59 for the new version. -O3+LTO, mcpu=sifive-p470 Metric: size..text results results0 diff test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 580768.00 581118.00 0.1% test-suite :: MultiSource/Applications/d/make_dparser.test 78854.00 78894.00 0.1% test-suite :: MultiSource/Applications/JM/lencod/lencod.test 633448.00 633750.00 0.0% test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 277002.00 277080.00 0.0% test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test 931938.00 931960.00 0.0% test-suite :: External/SPEC/CINT2006/403.gcc/403.gcc.test 2512806.00 2512822.00 0.0% test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test 7659880.00 7659876.00 -0.0% test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test 7659880.00 7659876.00 -0.0% test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 1602448.00 1602434.00 -0.0% test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 9496664.00 9496542.00 -0.0% test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 147424.00 147422.00 -0.0% test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 1764608.00 1764578.00 -0.0% test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 1764608.00 1764578.00 -0.0% test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 841656.00 841632.00 -0.0% test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test 949026.00 948962.00 -0.0% test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 946348.00 946284.00 -0.0% test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 279794.00 279764.00 -0.0% test-suite :: MultiSource/Benchmarks/mediabench/g721/g721encode/encode.test 4776.00 4772.00 -0.1% test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 25074.00 25028.00 -0.2% test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 25074.00 25028.00 -0.2% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test 29336.00 29184.00 -0.5% test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 535390.00 510124.00 -4.7% test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 535390.00 510124.00 -4.7% test-suite :: SingleSource/Regression/C/gcc-c-torture/execute/ieee/GCC-C-execute-ieee-pr50310.test 886.00 608.00 -31.4% CINT2006/464.h264ref - extra v16i32 reduction d/make_dparser - better vector code JM/lencod - extra v16i32 reduction Benchmarks/Bullet - smaller vector code CINT2006/400.perlbench - better vector code CINT2006/403.gcc - small variations CINT2017speed/602.gcc_s CINT2017rate/502.gcc_r - small variations CFP2017rate/510.parest_r - small variations CFP2017rate/526.blender_r - small variations MiBench/consumer-lame - small variations CINT2017speed/600.perlbench_s CINT2017rate/500.perlbench_r - small variations Benchmarks/7zip - small variations CFP2017rate/511.povray_r - small variations JM/ldecod - extra vector code mediabench/g721/g721encode - extra vector code mediabench/gsm - extra vector code MiBench/telecomm-gsm - extra vector code DOE-ProxyApps-C/miniGMG - extra vector code CINT2017rate/525.x264_r CINT2017speed/625.x264_s - reduced number of wide operations and shuffles, saving the registers, similar to X86, extra code in pixel_hadamard_ac vectorized ieee/GCC-C-execute-ieee-pr50310 - extra code vectorized CINT2006/464.h264ref - extra vector code in find_sad_16x16 JM/lencod - extra vector code in find_sad_16x16 d/make_dparser - smaller vector code Benchmarks/Bullet - small variations CINT2006/400.perlbench - smaller vector code CFP2017rate/526.blender_r - small variations, extra store <8 x float> in the loop, extra store <8 x i8> in loop CINT2017rate/500.perlbench_r CINT2017speed/600.perlbench_s - small variations MiBench/consumer-lame - small variations JM/ldecod - extra vector code mediabench/g721/g721encode - small variations Reviewers: hiraditya Reviewed By: hiraditya Pull Request: https://github.com/llvm/llvm-project/pull/128907	2025-03-12 08:18:51 -07:00
Hans Wennborg	5ec884e5d8	Revert "[SLP]Reduce number of alternate instruction, where possible" This caused assertion failures: llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:16237: Value llvm::slpvectorizer::BoUpSLP::vectorizeTree(TreeEntry ): Assertion `OpTE1.isSame( ArrayRef(E->Scalars).take_front(OpTE1.getVectorFactor())) && "Expected same first part of scalars."' failed. See comment on the PR. > Previous version was reviewed here https://github.com/llvm/llvm-project/pull/123360 > It is mostly the same, adjusted after graph-to-tree transformation This reverts commit 7de895ff1146c17ec78877900c01c09f4140e692.	2025-03-12 11:16:02 +01:00
Alexey Bataev	7de895ff11	[SLP]Reduce number of alternate instruction, where possible Previous version was reviewed here https://github.com/llvm/llvm-project/pull/123360 It is mostly the same, adjusted after graph-to-tree transformation Patch tries to remove wide alternate operations. Currently SLP vectorizer emits something like this: ``` %0 = add i32 %1 = sub i32 %2 = add i32 %3 = sub i32 %4 = add i32 %5 = sub i32 %6 = add i32 %7 = sub i32 transformes to %v1 = add <8 x i32> %v2 = sub <8 x i32> %res = shuffle %v1, %v2, <0, 9, 2, 11, 4, 13, 6, 15> ``` i.e. half of the results are just unused. This leads to increased register pressure and potentially doubles number of operations. Patch introduces SplitVectorize mode, where it splits the operations by opcodes and produces instead something like this: ``` %v1 = add <4 x i32> %v2 = sub <4 x i32> %res = shuffle %v1, %v2, <0, 4, 1, 5, 2, 6, 3, 7> ``` It allows to improve the performance by reducing number of ops. Also, it turns on some other improvements, like improved graph reordering. -O3+LTO, AVX512 Metric: size..text Program size..text results results0 diff test-suite :: MultiSource/Benchmarks/Olden/tsp/tsp.test 2788.00 2820.00 1.1% test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test 278168.00 280904.00 1.0% test-suite :: MultiSource/Benchmarks/FreeBench/pifft/pifft.test 82682.00 83258.00 0.7% test-suite :: External/SPEC/CFP2006/433.milc/433.milc.test 139344.00 139712.00 0.3% test-suite :: MultiSource/Benchmarks/FreeBench/fourinarow/fourinarow.test 27149.00 27197.00 0.2% test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test 1008188.00 1009948.00 0.2% test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 39226.00 39290.00 0.2% test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 39229.00 39293.00 0.2% test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 2074533.00 2076549.00 0.1% test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 2074533.00 2076549.00 0.1% test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 798440.00 798952.00 0.1% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test 44123.00 44139.00 0.0% test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 318942.00 319038.00 0.0% test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 1159880.00 1160152.00 0.0% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniAMR/miniAMR.test 73595.00 73611.00 0.0% test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test 1146124.00 1146348.00 0.0% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/CLAMR/CLAMR.test 203831.00 203847.00 0.0% test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 207662.00 207678.00 0.0% test-suite :: External/SPEC/CFP2006/447.dealII/447.dealII.test 589851.00 589883.00 0.0% test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test 1398543.00 1398559.00 0.0% test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test 1398543.00 1398559.00 0.0% test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 2050990.00 2051006.00 0.0% test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12559687.00 12559591.00 -0.0% test-suite :: External/SPEC/CINT2006/403.gcc/403.gcc.test 3074157.00 3074125.00 -0.0% test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test 1092252.00 1092188.00 -0.0% test-suite :: External/SPEC/CFP2017rate/508.namd_r/508.namd_r.test 779763.00 779715.00 -0.0% test-suite :: MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000.test 253517.00 253485.00 -0.0% test-suite :: MultiSource/Applications/JM/lencod/lencod.test 848259.00 848035.00 -0.0% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test 93064.00 93016.00 -0.1% test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 383747.00 383475.00 -0.1% test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 673051.00 662907.00 -1.5% test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 673051.00 662907.00 -1.5% Olden/tsp - small variations Prolangs-C/TimberWolfMC - small variations, some code not inlined FreeBench/pifft - extra store <8 x double> vectorized, some other extra vectorizations CFP2006/433.milc - better vector code FreeBench/fourinarow - better vector code Benchmarks/tramp3d-v4 - extra vector code, small variations mediabench/gsm/toast - small variations MiBench/telecomm-gsm - small variations CINT2017rate/500.perlbench_r CINT2017speed/600.perlbench_s - better vector code, small variations CINT2006/464.h264ref - some smaller code + changes similar to x264 DOE-ProxyApps-C/miniGMG - small variations Benchmarks/Bullet - small variations CFP2017rate/511.povray_r - small variations DOE-ProxyApps-C/miniAMR - small variations CFP2006/453.povray - small variations DOE-ProxyApps-C++/CLAMR - small variations MiBench/consumer-lame - small variations CFP2006/447.dealII - small variations CFP2017rate/538.imagick_r CFP2017speed/638.imagick_s - small variations CFP2017rate/510.parest_r - better vector code, small variations CFP2017rate/526.blender_r - small variations CINT2006/403.gcc - small variations CINT2006/400.perlbench - small variations CFP2017rate/508.namd_r - small variations ASCI_Purple/SMG2000 - small variations JM/lencod - extra store <16 x i32>, small variations DOE-ProxyApps-C++/miniFE - small variations JM/ldecod - extra vector code, small variations, less shuffles CINT2017speed/625.x264_s CINT2017rate/525.x264_r - the number of instructions increased, but looks like they are more performant. E.g., for function x264_pixel_satd_8x8, llvm-mca reports better throughput - 84 for the current version and 59 for the new version. -O3+LTO, mcpu=sifive-p470 Metric: size..text results results0 diff test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 580768.00 581118.00 0.1% test-suite :: MultiSource/Applications/d/make_dparser.test 78854.00 78894.00 0.1% test-suite :: MultiSource/Applications/JM/lencod/lencod.test 633448.00 633750.00 0.0% test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 277002.00 277080.00 0.0% test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test 931938.00 931960.00 0.0% test-suite :: External/SPEC/CINT2006/403.gcc/403.gcc.test 2512806.00 2512822.00 0.0% test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test 7659880.00 7659876.00 -0.0% test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test 7659880.00 7659876.00 -0.0% test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 1602448.00 1602434.00 -0.0% test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 9496664.00 9496542.00 -0.0% test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 147424.00 147422.00 -0.0% test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 1764608.00 1764578.00 -0.0% test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 1764608.00 1764578.00 -0.0% test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 841656.00 841632.00 -0.0% test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test 949026.00 948962.00 -0.0% test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 946348.00 946284.00 -0.0% test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 279794.00 279764.00 -0.0% test-suite :: MultiSource/Benchmarks/mediabench/g721/g721encode/encode.test 4776.00 4772.00 -0.1% test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 25074.00 25028.00 -0.2% test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 25074.00 25028.00 -0.2% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test 29336.00 29184.00 -0.5% test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 535390.00 510124.00 -4.7% test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 535390.00 510124.00 -4.7% test-suite :: SingleSource/Regression/C/gcc-c-torture/execute/ieee/GCC-C-execute-ieee-pr50310.test 886.00 608.00 -31.4% CINT2006/464.h264ref - extra v16i32 reduction d/make_dparser - better vector code JM/lencod - extra v16i32 reduction Benchmarks/Bullet - smaller vector code CINT2006/400.perlbench - better vector code CINT2006/403.gcc - small variations CINT2017speed/602.gcc_s CINT2017rate/502.gcc_r - small variations CFP2017rate/510.parest_r - small variations CFP2017rate/526.blender_r - small variations MiBench/consumer-lame - small variations CINT2017speed/600.perlbench_s CINT2017rate/500.perlbench_r - small variations Benchmarks/7zip - small variations CFP2017rate/511.povray_r - small variations JM/ldecod - extra vector code mediabench/g721/g721encode - extra vector code mediabench/gsm - extra vector code MiBench/telecomm-gsm - extra vector code DOE-ProxyApps-C/miniGMG - extra vector code CINT2017rate/525.x264_r CINT2017speed/625.x264_s - reduced number of wide operations and shuffles, saving the registers, similar to X86, extra code in pixel_hadamard_ac vectorized ieee/GCC-C-execute-ieee-pr50310 - extra code vectorized CINT2006/464.h264ref - extra vector code in find_sad_16x16 JM/lencod - extra vector code in find_sad_16x16 d/make_dparser - smaller vector code Benchmarks/Bullet - small variations CINT2006/400.perlbench - smaller vector code CFP2017rate/526.blender_r - small variations, extra store <8 x float> in the loop, extra store <8 x i8> in loop CINT2017rate/500.perlbench_r CINT2017speed/600.perlbench_s - small variations MiBench/consumer-lame - small variations JM/ldecod - extra vector code mediabench/g721/g721encode - small variations Reviewers: hiraditya Reviewed By: hiraditya Pull Request: https://github.com/llvm/llvm-project/pull/128907	2025-03-11 11:40:28 -07:00
Hans Wennborg	e858b10917	Revert "[SLP]Reduce number of alternate instruction, where possible" This caused failures such as: Instruction does not dominate all uses! %29 = insertelement <8 x i64> %28, i64 %xor6.i.5, i64 6 %17 = shufflevector <8 x i64> %29, <8 x i64> poison, <6 x i32> <i32 1, i32 2, i32 3, i32 4, i32 5, i32 6> see comment on https://github.com/llvm/llvm-project/pull/123360 > Previous version was reviewed here https://github.com/llvm/llvm-project/pull/123360 > It is mostly the same, adjusted after graph-to-tree transformation > > Patch tries to remove wide alternate operations. > Currently SLP vectorizer emits something like this: > ``` > %0 = add i32 > %1 = sub i32 > %2 = add i32 > %3 = sub i32 > %4 = add i32 > %5 = sub i32 > %6 = add i32 > %7 = sub i32 > > transformes to > > %v1 = add <8 x i32> > %v2 = sub <8 x i32> > %res = shuffle %v1, %v2, <0, 9, 2, 11, 4, 13, 6, 15> > ``` > i.e. half of the results are just unused. This leads to increased > register pressure and potentially doubles number of operations. > > Patch introduces SplitVectorize mode, where it splits the operations by > opcodes and produces instead something like this: > ``` > %v1 = add <4 x i32> > %v2 = sub <4 x i32> > %res = shuffle %v1, %v2, <0, 4, 1, 5, 2, 6, 3, 7> > ``` > It allows to improve the performance by reducing number of ops. Also, it > turns on some other improvements, like improved graph reordering. > > [...] This reverts commit 9d37e61fc77d3d6de891c30630f1c0227522031d as well as the follow-up commit 72bb0a9a9c6fdde43e1e191f2dc0d5d2d46aff4e.	2025-03-11 15:04:36 +01:00
Alexey Bataev	9d37e61fc7	[SLP]Reduce number of alternate instruction, where possible Previous version was reviewed here https://github.com/llvm/llvm-project/pull/123360 It is mostly the same, adjusted after graph-to-tree transformation Patch tries to remove wide alternate operations. Currently SLP vectorizer emits something like this: ``` %0 = add i32 %1 = sub i32 %2 = add i32 %3 = sub i32 %4 = add i32 %5 = sub i32 %6 = add i32 %7 = sub i32 transformes to %v1 = add <8 x i32> %v2 = sub <8 x i32> %res = shuffle %v1, %v2, <0, 9, 2, 11, 4, 13, 6, 15> ``` i.e. half of the results are just unused. This leads to increased register pressure and potentially doubles number of operations. Patch introduces SplitVectorize mode, where it splits the operations by opcodes and produces instead something like this: ``` %v1 = add <4 x i32> %v2 = sub <4 x i32> %res = shuffle %v1, %v2, <0, 4, 1, 5, 2, 6, 3, 7> ``` It allows to improve the performance by reducing number of ops. Also, it turns on some other improvements, like improved graph reordering. -O3+LTO, AVX512 Metric: size..text Program size..text results results0 diff test-suite :: MultiSource/Benchmarks/Olden/tsp/tsp.test 2788.00 2820.00 1.1% test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test 278168.00 280904.00 1.0% test-suite :: MultiSource/Benchmarks/FreeBench/pifft/pifft.test 82682.00 83258.00 0.7% test-suite :: External/SPEC/CFP2006/433.milc/433.milc.test 139344.00 139712.00 0.3% test-suite :: MultiSource/Benchmarks/FreeBench/fourinarow/fourinarow.test 27149.00 27197.00 0.2% test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test 1008188.00 1009948.00 0.2% test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 39226.00 39290.00 0.2% test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 39229.00 39293.00 0.2% test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 2074533.00 2076549.00 0.1% test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 2074533.00 2076549.00 0.1% test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 798440.00 798952.00 0.1% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test 44123.00 44139.00 0.0% test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 318942.00 319038.00 0.0% test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 1159880.00 1160152.00 0.0% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniAMR/miniAMR.test 73595.00 73611.00 0.0% test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test 1146124.00 1146348.00 0.0% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/CLAMR/CLAMR.test 203831.00 203847.00 0.0% test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 207662.00 207678.00 0.0% test-suite :: External/SPEC/CFP2006/447.dealII/447.dealII.test 589851.00 589883.00 0.0% test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test 1398543.00 1398559.00 0.0% test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test 1398543.00 1398559.00 0.0% test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 2050990.00 2051006.00 0.0% test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12559687.00 12559591.00 -0.0% test-suite :: External/SPEC/CINT2006/403.gcc/403.gcc.test 3074157.00 3074125.00 -0.0% test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test 1092252.00 1092188.00 -0.0% test-suite :: External/SPEC/CFP2017rate/508.namd_r/508.namd_r.test 779763.00 779715.00 -0.0% test-suite :: MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000.test 253517.00 253485.00 -0.0% test-suite :: MultiSource/Applications/JM/lencod/lencod.test 848259.00 848035.00 -0.0% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test 93064.00 93016.00 -0.1% test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 383747.00 383475.00 -0.1% test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 673051.00 662907.00 -1.5% test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 673051.00 662907.00 -1.5% Olden/tsp - small variations Prolangs-C/TimberWolfMC - small variations, some code not inlined FreeBench/pifft - extra store <8 x double> vectorized, some other extra vectorizations CFP2006/433.milc - better vector code FreeBench/fourinarow - better vector code Benchmarks/tramp3d-v4 - extra vector code, small variations mediabench/gsm/toast - small variations MiBench/telecomm-gsm - small variations CINT2017rate/500.perlbench_r CINT2017speed/600.perlbench_s - better vector code, small variations CINT2006/464.h264ref - some smaller code + changes similar to x264 DOE-ProxyApps-C/miniGMG - small variations Benchmarks/Bullet - small variations CFP2017rate/511.povray_r - small variations DOE-ProxyApps-C/miniAMR - small variations CFP2006/453.povray - small variations DOE-ProxyApps-C++/CLAMR - small variations MiBench/consumer-lame - small variations CFP2006/447.dealII - small variations CFP2017rate/538.imagick_r CFP2017speed/638.imagick_s - small variations CFP2017rate/510.parest_r - better vector code, small variations CFP2017rate/526.blender_r - small variations CINT2006/403.gcc - small variations CINT2006/400.perlbench - small variations CFP2017rate/508.namd_r - small variations ASCI_Purple/SMG2000 - small variations JM/lencod - extra store <16 x i32>, small variations DOE-ProxyApps-C++/miniFE - small variations JM/ldecod - extra vector code, small variations, less shuffles CINT2017speed/625.x264_s CINT2017rate/525.x264_r - the number of instructions increased, but looks like they are more performant. E.g., for function x264_pixel_satd_8x8, llvm-mca reports better throughput - 84 for the current version and 59 for the new version. -O3+LTO, mcpu=sifive-p470 Metric: size..text results results0 diff test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 580768.00 581118.00 0.1% test-suite :: MultiSource/Applications/d/make_dparser.test 78854.00 78894.00 0.1% test-suite :: MultiSource/Applications/JM/lencod/lencod.test 633448.00 633750.00 0.0% test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 277002.00 277080.00 0.0% test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test 931938.00 931960.00 0.0% test-suite :: External/SPEC/CINT2006/403.gcc/403.gcc.test 2512806.00 2512822.00 0.0% test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test 7659880.00 7659876.00 -0.0% test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test 7659880.00 7659876.00 -0.0% test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 1602448.00 1602434.00 -0.0% test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 9496664.00 9496542.00 -0.0% test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 147424.00 147422.00 -0.0% test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 1764608.00 1764578.00 -0.0% test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 1764608.00 1764578.00 -0.0% test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 841656.00 841632.00 -0.0% test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test 949026.00 948962.00 -0.0% test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 946348.00 946284.00 -0.0% test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 279794.00 279764.00 -0.0% test-suite :: MultiSource/Benchmarks/mediabench/g721/g721encode/encode.test 4776.00 4772.00 -0.1% test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 25074.00 25028.00 -0.2% test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 25074.00 25028.00 -0.2% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test 29336.00 29184.00 -0.5% test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 535390.00 510124.00 -4.7% test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 535390.00 510124.00 -4.7% test-suite :: SingleSource/Regression/C/gcc-c-torture/execute/ieee/GCC-C-execute-ieee-pr50310.test 886.00 608.00 -31.4% CINT2006/464.h264ref - extra v16i32 reduction d/make_dparser - better vector code JM/lencod - extra v16i32 reduction Benchmarks/Bullet - smaller vector code CINT2006/400.perlbench - better vector code CINT2006/403.gcc - small variations CINT2017speed/602.gcc_s CINT2017rate/502.gcc_r - small variations CFP2017rate/510.parest_r - small variations CFP2017rate/526.blender_r - small variations MiBench/consumer-lame - small variations CINT2017speed/600.perlbench_s CINT2017rate/500.perlbench_r - small variations Benchmarks/7zip - small variations CFP2017rate/511.povray_r - small variations JM/ldecod - extra vector code mediabench/g721/g721encode - extra vector code mediabench/gsm - extra vector code MiBench/telecomm-gsm - extra vector code DOE-ProxyApps-C/miniGMG - extra vector code CINT2017rate/525.x264_r CINT2017speed/625.x264_s - reduced number of wide operations and shuffles, saving the registers, similar to X86, extra code in pixel_hadamard_ac vectorized ieee/GCC-C-execute-ieee-pr50310 - extra code vectorized CINT2006/464.h264ref - extra vector code in find_sad_16x16 JM/lencod - extra vector code in find_sad_16x16 d/make_dparser - smaller vector code Benchmarks/Bullet - small variations CINT2006/400.perlbench - smaller vector code CFP2017rate/526.blender_r - small variations, extra store <8 x float> in the loop, extra store <8 x i8> in loop CINT2017rate/500.perlbench_r CINT2017speed/600.perlbench_s - small variations MiBench/consumer-lame - small variations JM/ldecod - extra vector code mediabench/g721/g721encode - small variations Reviewers: hiraditya Reviewed By: hiraditya Pull Request: https://github.com/llvm/llvm-project/pull/128907	2025-03-10 10:06:39 -04:00
Luke Lau	e1cea0d928	[LV][TTI] Remove unused ReductionFlags. NFC (#129858 ) No in-tree targets currently use it in the preferInLoopReduction/preferPredicatedReductionSelect TTI hooks. It looks like it used to be used in LoopUtils, at least in 8ca60db40bd944dc5f67e0f200a403b4e03818ea, but I presume it was replaced by RecurrenceDescriptor.	2025-03-05 18:31:12 +08:00
Alexander Richardson	17f0aaac57	[TTI] Assert that TargetIRAnalyis is not requested for intrinsics This catches the bug fixed in https://github.com/llvm/llvm-project/pull/127760 and also finds another call in LowerTypeTests where we request the TTI for instrinsics instead of skipping them. Reviewed By: nikic Pull Request: https://github.com/llvm/llvm-project/pull/129600	2025-03-04 08:50:38 -08:00
Benjamin Maxwell	900220d444	[CostModel] Handle vector struct results and cost `llvm.sincos` (#123210 ) This patch updates the cost model to cost intrinsics that return multiple values (in structs) correctly. Previously, the cost model only thought intrinsics that return `VectorType` need scalarizing, which meant it cost intrinsics that return multiple vectors (that need scalarizing) way too cheap (giving it the cost of a single function call). This patch also adds a custom cost for llvm.sincos when a vector function library is available, as certain VFs can be expanded (later in code gen) to a vector function, reducing the cost to a single call (+ the possible loads from the vector function returns values via output pointers).	2025-02-26 11:37:34 +00:00
David Green	b9622e84b4	[TTI][AArch64] Detect OperandInfo from scalable splats. (#122469 ) Pulled out of #122236, this allows Splats constants to be recognized by getOperandInfo, allowing "better" costs for instructions like divides by constants to be produced (which are expanded into mul+add+shift). Some of the costs are not very accurate yet, but the comparison of scalar vs fixed-width vs scalable for the same div can become more accurate, especially with patches like #122236.	2025-02-21 10:41:44 +00:00
Benjamin Maxwell	e0e67a6207	[LV] Add initial support for vectorizing literal struct return values (#109833 ) This patch adds initial support for vectorizing literal struct return values. Currently, this is limited to the case where the struct is homogeneous (all elements have the same type) and not packed. The users of the call also must all be `extractvalue` instructions. The intended use case for this is vectorizing intrinsics such as: ``` declare { float, float } @llvm.sincos.f32(float %x) ``` Mapping them to structure-returning library calls such as: ``` declare { <4 x float>, <4 x float> } @Sleef_sincosf4_u10advsimd(<4 x float>) ``` Or their widened form (such as `@llvm.sincos.v4f32` in this case). Implementing this required two main changes: 1. Supporting widening `extractvalue` 2. Adding support for vectorized struct types in LV * This is mostly limited to parts of the cost model and scalarization Since the supported use case is narrow, the required changes are relatively small.	2025-02-17 09:51:35 +00:00
Sergey Kachkov	8201cf311a	[TTI][CostModel] Add cost modeling for expandload and compressstore intrinsics (#122882 ) This patch adds methods for cost estimation for llvm.masked.expandload/llvm.masked.compressstore intrinsics in TTI. If backend doesn't support custom lowering of these intrinsics it will be processed by ScalarizeMaskedMemIntrin so we estimate its cost via getCommonMaskedMemoryOpCost as gather/scatter operation; for RISC-V backend, this patch implements custom hook to calculate the cost based on current lowering scheme.	2025-02-04 15:08:30 +03:00
Martin Storsjö	d00579be39	Revert "[SLP]Reduce number of alternate instruction, where possible" This reverts commit d5a7a483a65f830a0c7a931781bc90046dc67ff4. That commit triggers failed asserts, see https://github.com/llvm/llvm-project/pull/123360 for details.	2025-02-02 15:56:08 +02:00
Alexey Bataev	d5a7a483a6	[SLP]Reduce number of alternate instruction, where possible Patch tries to remove wide alternate operations. Currently SLP vectorizer emits something like this: ``` %0 = add i32 %1 = sub i32 %2 = add i32 %3 = sub i32 %4 = add i32 %5 = sub i32 %6 = add i32 %7 = sub i32 transformes to %v1 = add <8 x i32> %v2 = sub <8 x i32> %res = shuffle %v1, %v2, <0, 9, 2, 11, 4, 13, 6, 15> ``` i.e. half of the results are just unused. This leads to increased register pressure and potentially doubles number of operations. Patch introduces SplitVectorize mode, where it splits the operations by opcodes and produces instead something like this: ``` %v1 = add <4 x i32> %v2 = sub <4 x i32> %res = shuffle %v1, %v2, <0, 4, 1, 5, 2, 6, 3, 7> ``` It allows to improve the performance by reducing number of ops. Also, it turns on some other improvements, like improved graph reordering. -O3+LTO, AVX512 Metric: size..text Program size..text results results0 diff test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test 277800.00 280536.00 1.0% test-suite :: MultiSource/Benchmarks/FreeBench/pifft/pifft.test 81802.00 82426.00 0.8% test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 790552.00 790952.00 0.1% test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 383795.00 383987.00 0.1% test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 2075541.00 2076501.00 0.0% test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 2075541.00 2076501.00 0.0% test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 312702.00 312766.00 0.0% test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12569783.00 12569751.00 -0.0% test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 2049374.00 2049358.00 -0.0% test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test 1091836.00 1091772.00 -0.0% test-suite :: MultiSource/Applications/JM/lencod/lencod.test 852339.00 852211.00 -0.0% test-suite :: MultiSource/Applications/oggenc/oggenc.test 190651.00 190523.00 -0.1% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test 44203.00 44155.00 -0.1% test-suite :: SingleSource/UnitTests/Vector/AVX512BWVL/Vector-AVX512BWVL-mask_set_bw.test 12997.00 12981.00 -0.1% test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 668971.00 658427.00 -1.6% test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 668971.00 658427.00 -1.6% Prolangs-C/TimberWolfMC/timberwolfmc - small variations, some code not inlined FreeBench/pifft - extra stores <8 x double> vectorized, some other extra vectorizations CINT2006/464.h264ref - some smaller code + changes similar to x264 JM/ldecod - changes similar x264 CINT2017speed/600.perlbench_s CINT2017rate/500.perlbench_r - significantly compact vector code Benchmarks/Bullet - small variations CFP2017rate/526.blender_r - small variations CFP2017rate/510.parest_r - small variations CINT2006/400.perlbench - extra vector code JM/lencod - extra store <16 x i32> and other changes similar x264 Applications/oggenc - extra store <16 x i8>, small variations DOE-ProxyApps-C/miniGMG - small variations Vector/AVX512BWVL/Vector-AVX512BWVL-mask_set_bw - better vector code CINT2017speed/625.x264_s CINT2017rate/525.x264_r - the number of instructions increased, but looks like they are more performant. E.g., for function x264_pixel_satd_8x8, llvm-mca reports better throughput - 84 for the current version and 59 for the new version. -O3+LTO, march=rva32u64 CINT2017rate/525.x264_r - similar to x86, extra code in pixel_hadamard_ac function vectorized, idct4x4dc stopped being vectorized (looks like issue with shuffles cost) CINT2006/400.perlbench - better vector code CINT2006/445.gobmk - some variations in vector code CINT2006/464.h264ref - extra code vectorized CINT2017rate/500.perlbench_r - small variations -O3+LTO, mcpu=sifive-p470 Metric: size..text Program size..text results results0 diff test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 587336.00 587668.00 0.1% test-suite :: MultiSource/Applications/JM/lencod/lencod.test 643308.00 643614.00 0.0% test-suite :: MultiSource/Applications/d/make_dparser.test 79678.00 79710.00 0.0% test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 277322.00 277420.00 0.0% test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test 933660.00 933682.00 0.0% test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 9497722.00 9497682.00 -0.0% test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 1767806.00 1767772.00 -0.0% test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 1767806.00 1767772.00 -0.0% test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 148038.00 148024.00 -0.0% test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 283036.00 283008.00 -0.0% test-suite :: MultiSource/Benchmarks/mediabench/g721/g721encode/encode.test 4776.00 4772.00 -0.1% test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 540582.00 511772.00 -5.3% test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 540582.00 511772.00 -5.3% CINT2006/464.h264ref - extra vector code in find_sad_16x16 JM/lencod - extra vector code in find_sad_16x16 d/make_dparser - smaller vector code Benchmarks/Bullet - small variations CINT2006/400.perlbench - smaller vector code CFP2017rate/526.blender_r - small variations, extra store <8 x float> in the loop, extra store <8 x i8> in loop CINT2017rate/500.perlbench_r CINT2017speed/600.perlbench_s - small variations MiBench/consumer-lame - small variations JM/ldecod - extra vector code mediabench/g721/g721encode - small variations CINT2017rate/525.x264_r CINT2017speed/625.x264_s - reduced number of wide operations and shuffles, saving the registers, similar to X86, extra code in pixel_hadamard_ac vectorized, idct4x4dc not vectorized (issue with some TTI costs) Reviewers: RKSimon, hiraditya Reviewed By: RKSimon Pull Request: https://github.com/llvm/llvm-project/pull/123360	2025-02-01 10:00:16 -08:00

1 2 3 4 5 ...

557 Commits