llvm-project

Author	SHA1	Message	Date
David Green	4ac2721e51	[AArch64] Add costs for ST3 and ST4 instructions, modelled as store(shuffle). (#87934 ) This tries to add some costs for the shuffle in a ST3/ST4 instruction, which are represented in LLVM IR as store(interleaving shuffle). In order to detect the store, it needs to add a CxtI context instruction to check the users of the shuffle. LD3 and LD4 are added, LD2 should be a zip1 shuffle, which will be added in another patch. It should help fix some of the regressions from #87510.	2024-04-09 16:36:08 +01:00
Graham Hunter	36a3f8f647	[TTI][TLI][AArch64] Support scalable immediates with isLegalAddImmediate (#84173 ) Adds a second parameter (default to 0) to isLegalAddImmediate, to represent a scalable immediate. Extends the AArch64 implementation to match immediates based on what addvl and inc[h\|w\|d] support.	2024-03-20 10:28:46 +00:00
Graham Hunter	cd768ec983	[AArch64] Support scalable offsets with isLegalAddressingMode (#83255 ) Allows us to indicate that an addressing mode featuring a vscale-relative immediate offset is supported.	2024-03-20 10:13:20 +00:00
Paschalis Mpeis	f795d1a8b1	[AArch64][LV][SLP] Vectorizers use call cost for vectorized frem (#82488 ) getArithmeticInstrCost is used by both LoopVectorizer and SLPVectorizer to compute the cost of frem, which becomes a call cost on AArch64 when TLI has a vector library function. Add tests that do SLP vectorization for code that contains 2x double and 4x float frem instructions.	2024-03-14 17:20:29 +00:00
Kolya Panchenko	889d99a50f	[TTI] Add alignment argument to TTI for compress/expand support (#83516 ) Since `llvm.compressstore` and `llvm.expandload` do require memory access, it's essential for some target to check if alignment is good to be able to lower them to target-specific instructions	2024-03-05 20:33:56 -05:00
Alexey Bataev	8ad14b6d90	[TTI]Add support for strided loads/stores. Added basic legality check and cost estimation functions for strided loads and stores. These interfaces will be built upon in https://github.com/llvm/llvm-project/pull/80310. Reviewers: preames Reviewed By: preames Pull Request: https://github.com/llvm/llvm-project/pull/80329	2024-02-01 16:07:38 -05:00
David Green	a2d68b4bec	[SelectOpt] Add handling for Select-like operations. (#77284 ) Some operations behave like selects. For example `or(zext(c), y)` is the same as select(c, y\|1, y)` and instcombine can canonicalize the select to the or form. These operations can still be worthwhile converting to branch as opposed to keeping as a select or or instruction. This patch attempts to add some basic handling for them, creating a SelectLike abstraction in the select optimization pass. The backend can opt into handling `or(zext(c),x)` as a select if it could be profitable, and the select optimization pass attempts to handle them in much the same way as a `select(c, x\|1, x)`. The Or(x, 1) may need to be added as a new instruction, generated as the or is converted to branches. This helps fix a regression from selects being converted to or's recently.	2024-01-22 23:46:58 +00:00
David Sherwood	c7148467fc	[AArch64] Add an AArch64 pass for loop idiom transformations (#72273 ) We have added a new pass that looks for loops such as the following: ``` while (i != max_len) if (a[i] != b[i]) break; ... use index i ... ``` Although similar to a memcmp, this is slightly different because instead of returning the difference between the values of the first non-matching pair of bytes, it returns the index of the first mismatch. As such, we are not able to lower this to a memcmp call. The new pass can now spot such idioms and transform them into a specialised predicated loop that gives a significant performance improvement for AArch64. It is intended as a stop-gap solution until this can be handled by the vectoriser, which doesn't currently deal with early exits. This specialised loop makes use of a generic intrinsic that counts the trailing zero elements in a predicate vector. This was added in https://reviews.llvm.org/D159283 and for SVE we end up with brkb & incp instructions. Although we have added this pass only for AArch64, it was written in a generic way so that in theory it could be used by other targets. Currently the pass requires scalable vector support and needs to know the minimum page size for the target, however it's possible to make it work for fixed-width vectors too. Also, the llvm.experimental.cttz.elts intrinsic used by the pass has generic lowering, but can be made efficient for targets with instructions similar to SVE's brkb, cntp and incp. Original version of patch was posted on Phabricator: https://reviews.llvm.org/D158291 Patch co-authored by Kerry McLaughlin (@kmclaughlin-arm) and David Sherwood (@david-arm) See the original discussion on Discourse: https://discourse.llvm.org/t/aarch64-target-specific-loop-idiom-recognition/72383	2024-01-09 11:29:28 +00:00
Alexey Bataev	5096501082	[SLP][TTI][X86]Add addsub pattern cost estimation. (#76461 ) SLP/TTI do not know about the cost estimation for addsub pattern, supported by X86. Previously the support for pattern detection was added (seeTTI::isLegalAltInstr), but the cost still did not estimated properly.	2023-12-28 05:04:04 -08:00
Douglas Yung	fb981e6b4b	Revert "[SLP][TTI][X86]Add addsub pattern cost estimation. (#76461 )" This reverts commit bc8c4bbd7973ab9527a78a20000aecde9bed652d. Change is failing to build on several bots: - https://lab.llvm.org/buildbot/#/builders/127/builds/60184 - https://lab.llvm.org/buildbot/#/builders/123/builds/23709 - https://lab.llvm.org/buildbot/#/builders/216/builds/32302	2023-12-27 23:52:04 -08:00
Alexey Bataev	bc8c4bbd79	[SLP][TTI][X86]Add addsub pattern cost estimation. (#76461 ) SLP/TTI do not know about the cost estimation for addsub pattern, supported by X86. Previously the support for pattern detection was added (seeTTI::isLegalAltInstr), but the cost still did not estimated properly.	2023-12-27 15:57:21 -05:00
Paul Walker	930b5b52ff	[ConstantHoisting] Add a TTI hook to prevent hoisting. (#69004 ) Code generation can sometimes simplify expensive operations when an operand is constant. An example of this is divides on AArch64 where they can be rewritten using a cheaper sequence of multiplies and subtracts. Doing this is often better than hoisting expensive constants which are likely to be hoisted by MachineLICM anyway.	2023-12-13 17:20:36 +00:00
Philip Reames	e947f95337	[LSR][TTI][RISCV] Enable terminator folding for RISC-V If looking for a miscompile revert candidate, look here! The transform being enabled prefers comparing to a loop invariant exit value for a secondary IV over using an otherwise dead primary IV. This increases register pressure (by requiring the exit value to be live through the loop), but reduces the number of instructions within the loop by one. On RISC-V which has a large number of scalar registers, this is generally a profitable transform. We loose the ability to use a beqz on what is typically a count down IV, and pay the cost of computing the exit value on the secondary IV in the loop preheader, but save an add or sub in the loop body. For anything except an extremely short running loop, or one with extreme register pressure, this is profitable. On spec2017, we see a 0.42% geomean improvement in dynamic icount, with no individual workload regressing by more than 0.25%. Code size wise, we trade a (possibly compressible) beqz and a (possibly compressible) addi for a uncompressible beq. We also add instructions in the preheader. Net result is a slight regression overall, but neutral or better inside the loop. Previous versions of this transform had numerous cornercase correctness bugs. All of them ones I can spot by inspection have been fixed, and I have run this through all of spec2017, but there may be further issues lurking. Adding uses to an IV is a fraught thing to do given poison semantics, so this transform is somewhat inherently risky. This patch is a reworked version of D134893 by @eop. That patch has been abandoned since May, so I picked it up, reworked it a bit, and am landing it.	2023-11-29 12:04:06 -08:00
Sander de Smalen	00a831421f	[AArch64][SME] Extend Inliner cost-model with custom penalty for calls. (#68416 ) This is a stacked PR following on from #68415 This patch has two purposes: (1) It tries to make inlining more likely when it can avoid a streaming-mode change. (2) It avoids inlining when inlining causes more streaming-mode changes. An example of (1) is: ``` void streaming_compatible_bar(void); void foo(void) __arm_streaming { /* other code / streaming_compatible_bar(); / other code / } void f(void) { foo(); // expensive streaming mode change } -> void f(void) { / other code / streaming_compatible_bar(); / other code */ } ``` where it wouldn't have inlined the function when foo would be a non-streaming function. An example of (2) is: ``` void streaming_bar(void) __arm_streaming; void foo(void) __arm_streaming { streaming_bar(); streaming_bar(); } void f(void) { foo(); // expensive streaming mode change } -> (do not inline into) void f(void) { streaming_bar(); // these are now two expensive streaming mode changes streaming_bar(); }```	2023-10-31 10:28:40 +00:00
Mingming Liu	aa6ee03709	[NFC][Inliner] Introduce another multiplier for cost benefit analysis and make multipliers overriddable in TargetTransformInfo. - The motivation is to expose tunable knobs to control the aggressiveness of inlines for different backend (e.g., machines with different icache size, and workload with different icache/itlb PMU counters). Tuning inline aggressiveness shows a small (~+0.3%) but stable improvement on workload/hardware that is more frontend bound. - Both multipliers could be overridden from command line. Reviewed By: kazu Differential Revision: https://reviews.llvm.org/D153154	2023-10-02 21:27:07 -07:00
David Green	12025cef3e	[CostModel] Use min/max intrinsics for vecreduce.min/max costs This changes the costmodelling of the vecreduce.min/max nodes to use the costs of the relevant min/max intrinsics instead of expanding them to compare and selects. The getMinMaxReductionCost have changed to take a Opcode for the relevant intrinsic, dropping the IsUnsigned and CondTy parameters as they are no longer needed. A follow up patch will add some basic fminimum/fmaximum costmodelling. Differential Revision: https://reviews.llvm.org/D153547	2023-07-04 15:02:30 +01:00
Luke Lau	a68dcd09e8	[TTI] Use users of GEP to guess access type in getGEPCost Currently getGEPCost uses the target type of the GEP as a heuristic for the type that will be accessed, to pass onto isLegalAddressingMode. Targets use this to work out if a GEP can then be folded into the load/store instruction that uses the GEP. For example, on RISC-V loads and stores can have an offset added to a base register folded into a single instruction, so the following GEP is free: %p = getelementptr i32, ptr %base, i32 42 ; getInstructionCost = 0 %x = load i32, ptr %p ; getInstructionCost = 1 ------------------------------------------------------------------------ lw t0, a0(42) However vector loads and stores cannot have an offset folded into them, so the following GEP is costed: %p = getelementptr <2 x i32>, ptr %base, i32 42 ; getInstructionCost = 1 %x = load <2 x i32>, ptr %p ; getInstructionCost = 1 ------------------------------------------------------------------------ addi a0, 42 vle32 v8, (a0) The issue arises whenever there is a mismatch between the target type of the GEP and the type that is actually accessed: %p = getelementptr i32, ptr %base, i32 42 ; getInstructionCost = 0 %x = load <2 x i32>, ptr %p ; getInstructionCost = 1 ------------------------------------------------------------------------ addi a0, 42 vle32 v8, (a0) Even though this GEP will result in an add instruction, because TTI thinks it's loading an i32, it will think it can be folded and not charge for it. The target type can become mismatched with the memory access during transformations, noticeably during SLP where a scalar base pointer will be reused to perform a vector load or store. This patch adds an optional AccessType argument to getGEPCost which allows the type of memory accessed by users to be passed in as a hint, so that we can more accurately determine if the GEP can be folded into its users. If AccessType is not provided, getGEPCost falls back to the old behaviour of using the PointeeType to guess the memory access type. This can be revisited in a later patch. Also for now, only GEPs with exactly one user use the access type hint. Whilst we could look through all users and use all access types to determine if we can fold the GEP, this patch avoids doing so to prevent O(N) behaviour. Differential Revision: https://reviews.llvm.org/D149889	2023-06-29 13:44:37 +01:00
Juan Manuel MARTINEZ CAAMAÑO	cc8a346e3f	[InlineCost][TargetTransformInfo][AMDGPU] Consider cost of alloca instructions in the caller (1/2) On AMDGPU, alloca instructions have penalty that can be avoided when SROA is applied after inlining. This patch introduces the default implementation of TargetTransformInfo::getCallerAllocaCost. Reviewed By: mtrofin Differential Revision: https://reviews.llvm.org/D149740	2023-06-29 09:49:16 +02:00
Matt Arsenault	12c12c5fe0	TTI: Add function to hasBranchDivergence It my be possible to contextually ignore divergence in a function if it's known to run single threaded.	2023-06-16 18:47:40 -04:00
Matt Arsenault	a09f79d227	TargetTransformInfo: Add addrspacesMayAlias For some reason we used to only handle address space aliasing through chaining a target specific AA pass. We need never-fail simple queries in order to lower memmove intrinsics based purely on the address spaces. I also think it would be better if BasicAA checked this, rather than relying on the target AA passes. Currently we go through the more expensive AA analyses before getting to the trivial address space checks.	2023-06-13 20:44:00 -04:00
Matt Arsenault	3c848194f2	CodeGen: Expand memory intrinsics in PreISelIntrinsicLowering Expand large or unknown size memory intrinsics into loops in the default lowering pipeline if the target doesn't have the corresponding libfunc. Previously AMDGPU had a custom pass which existed to call the expansion utilities. With a default no-libcall option, we can remove the libfunc checks in LoopIdiomRecognize for these, which never made any sense. This also provides a path to lifting the immarg restriction on llvm.memcpy.inline. There seems to be a bug where TLI reports functions as available if you use -march and not -mtriple.	2023-06-09 21:04:37 -04:00
Luke Lau	c27a0b21c5	[SLP][RISCV] Account for offset folding in getPointersChainCost For a GEP in a pointer chain, if: 1) a pointer chain is unit-strided 2) the base pointer wasn't folded and is sitting in a register somewhere 3) the distance between the GEP and the base pointer is small enough and can be folded into the addressing mode of the using load/store Then we can exclude that GEP from the total cost of the pointer chain, as it will likely be folded away. In order to check if 3) holds, we need to know the type of memory access being made by the users of the pointer chain. For that, we need to pass along a new argument to getPointersChainCost. (Using the source pointer type of the GEP isn't accurate, see https://reviews.llvm.org/D149889 for more details). Also note that 2) is currently an assumption, and could be modelled more accurately. This prevents some unprofitable cases from being SLP vectorized on RISC-V by making the scalar costs cheaper and closer to the actual codegen. For now the getPointersChainCost hook is duplicated for RISC-V to prevent disturbing other targets, but could be merged back in and shared with other targets in a following patch. Reviewed By: ABataev Differential Revision: https://reviews.llvm.org/D149654	2023-05-22 13:55:30 +01:00
Yonghong Song	da816c2985	[TTI][BPF] Ensure ArgumentPromotion Not Exceeding Target MaxArgs With LLVM patch https://reviews.llvm.org/D148269, we hit a linux kernel bpf selftest compilation failure like below: ... progs/test_xdp_noinline.c:739:8: error: too many args to t8: i64 = GlobalAddress<ptr @encap_v4> 0, progs/test_xdp_noinline.c:739:8 if (!encap_v4(xdp, cval, &pckt, dst, pkt_bytes)) ^ ... progs/test_xdp_noinline.c:321:6: error: defined with too many args bool encap_v4(struct xdp_md xdp, struct ctl_value cval, ^ ... Note that bpf selftests are compiled with -O2 which is the recommended flag for bpf community. The bpf backend calling convention is only allowing 5 parameters in registers and does not allow pass arguments through stacks. In the above case, ArgumentPromotionPass replaced parameter '&pckt' as two parameters, so the total number of arguments after ArgumentPromotion pass becomes 6 and this caused later compilation failure during instruction selection phase. This patch added a TargetTransformInfo hook getMaxNumArgs() which returns 5 for BPF and UINT_MAX for other targets. Differential Revision: https://reviews.llvm.org/D148551	2023-04-19 09:09:20 -07:00
pvanhout	ae77aceba5	[Analysis] Remove DA & LegacyDA UniformityAnalysis offers all of the same features and much more, there is no reason left to use the legacy DAs. See RFC: https://discourse.llvm.org/t/rfc-deprecate-divergenceanalysis-legacydivergenceanalysis/69538 - Remove LegacyDivergenceAnalysis.h/.cpp - Remove DivergenceAnalysis.h/.cpp + Unit tests - Remove SyncDependenceAnalysis - it was not a real registered analysis and was only used by DAs - Remove/adjust references to the passes in the docs where applicable - Remove TTI hook associated with those passes. - Move tests to UniformityAnalysis folder. - Remove RUN lines for the DA, leave only the UA ones. - Some tests had to be adjusted/removed depending on how they used the legacy DAs. Reviewed By: foad, sameerds Differential Revision: https://reviews.llvm.org/D148116	2023-04-17 09:01:22 +02:00
Simon Pilgrim	fb8038db73	[TTI] getExtendedReductionCost - replace std::optional<FastMathFlags> args with FastMathFlags Followup to D148149 where it was noticed that the std::optional wrapper wasn't helping with anything (we can just use an empty FastMathFlags()).	2023-04-13 11:26:28 +01:00
Simon Pilgrim	9e30b87afb	[TTI] getMinMaxReductionCost - add FastMathFlag argument Similar to the getArithmeticReductionCost / getExtendedReductionCost calls (which really don't need to use std::optional<>). This will be necessary to correct recognize fast/nnan fmax/fmul reductions which can avoid nan handling - which will allow us to remove the fmax/fmin special case in X86TTIImpl::getMinMaxCost and use getIntrinsicInstrCost like we do for integer reductions (63c3895327839ba5b57f5b99ec9e888abf976ac6). Differential Revision: https://reviews.llvm.org/D148149	2023-04-13 10:42:42 +01:00
Michael Liao	72fc08a541	[InstCombine] Teach alloca replacement to handle `addrspacecast` - As the address space cast may not be valid on a specific target, `addrspacecast` is not handled when an `alloca` is able to be replaced with the source of memcpy/memmove. This patch addresses that by querying a target hook on whether that address space cast is valid. For example, on most GPU targets, the cast from a global pointer to a generic pointer is valid. - If that cast is allowedd (by querying `isValidAddrSpaceCast`), the replacement is enhanced to handle that `addrspacecast` as well. Reviewed By: yaxunl Differential Revision: https://reviews.llvm.org/D147025	2023-04-11 11:47:37 -04:00
David Sherwood	b4089cfa2f	[NFC][LoopVectorize] Simplify preferPredicateOverEpilogue interface Given just how many arguments we pass to preferPredicateOverEpilogue and considering this list may grow over time I've decided to pass in a pointer to a new TailFoldingInfo structure instead, similar to what we do with IntrinsicCostAttributes, etc. In addition, many of the arguments we pass in are actually available in the LoopVectorizationLegality class so I've managed to reduce the set of pointers that we need to pass in the TailFoldingInfo struct. Differential Revision: https://reviews.llvm.org/D146127	2023-04-04 14:00:49 +00:00
Wang, Xin10	0bf75d8541	Handle the unexpected inputs for pass HardwareLoops For a function TryConvertLoop in pass HardwareLoops, wrong input arguments will lead to crash. There will be 3 cases. In line 342, compiler want to get something from HWLoopInfo.CountType, which depends on if argument Bitwidth is given, if not, will crash. In Function isHardwareLoopCandidate, it dereference CountType too. In Function InsertLoopDec, it dereference LoopDecrement. They all could lead to crash. This patch add condition to this pass, when we meet unexpected inputs then skip the pass. Reviewed By: samparker, fhahn Differential Revision: https://reviews.llvm.org/D146277	2023-03-30 10:42:19 +08:00
David Sherwood	1c4fedfa35	[LoopVectorize] Don't tail-fold for scalable VFs when there is no scalar tail Currently in LoopVectorize we avoid tail-folding if we can prove the trip count is always a multiple of the maximum fixed-width VF. This works because we know the vectoriser only ever chooses a VF that is a power of 2. However, if we are also considering scalable VFs then we conservatively bail out of the optimisation because we don't know the value of vscale, which could be an odd or prime number, etc. This patch tries to enable the same optimisation for scalable VFs by asking if vscale is known to be a power of 2. If so, we can then query the maximum value of vscale and use the same logic as we do for fixed-width VFs. I've also added a new TTI hook called isVScaleKnownToBeAPowerOfTwo that does the same thing as the existing TargetLowering hook. Differential Revision: https://reviews.llvm.org/D146199	2023-03-27 08:34:30 +00:00
Valery N Dmitriev	f9b438b519	[SLP] Outline GEP chain cost modeling into new TTI interface - NFCI. Cost modeling for GEPs should actually be target dependent but is currently done inside SLP target-independent way. Sinking it into TTI enables target dependent implementation. This patch adds new TTI interface and implementation of the basic functionality trying to retain existing cost modeling. Differential Revision: https://reviews.llvm.org/D144770	2023-03-14 14:01:34 -07:00
Sander de Smalen	c41b41eb11	[LoopVectorize] Use overflow-check analysis to improve tail-folding. This work follows on from D142109 and addresses a possible regression when we know the loop iteration counter cannot overflow. When we know the overflow-check always evaluates to false, it's better to use the other style of tail folding where it assumes a runtime check was added, because that avoids having to calculate a modified trip-count. Reviewed By: paulwalker-arm Differential Revision: https://reviews.llvm.org/D142894	2023-03-01 14:17:58 +00:00
Luke Lau	b02b1e0ed6	[LV][NFC] Use ElementCount for getMaxInterleaveFactor In order to allow targets to disable interleaving for scalable vectors, pass the entire VF's ElementCount to getMaxInterleaveFactor. This is based off of the approach used here: `8d36708507` The plan would then be to disable interleaving on scalable VFs on RISC-V in a follow up patch. See https://reviews.llvm.org/D143723#4132349 Reviewed By: reames Differential Revision: https://reviews.llvm.org/D144474	2023-02-22 10:15:05 +00:00
Simon Tatham	a8cd35c3b7	[LowerTypeTests] Support generating Armv6-M jump tables. (reland) [Originally committed as f6ddf7781471b71243fa3c3ae7c93073f95c7dff; reverted in bbef38352fbade9e014ec97d5991da5dee306da7 due to test breakage; now relanded with the Arm tests conditioned on `arm-registered-target`] The LowerTypeTests pass emits a jump table in the form of an `inlineasm` IR node containing a string representation of some assembly. It tests the target triple to see what architecture it should be generating assembly for. But that's not good enough for `Triple::thumb`, because the 32-bit PC-relative `b.w` branch instruction isn't available in all supported architecture versions. In particular, Armv6-M doesn't support that instruction (although the similar Armv8-M Baseline does). Most of this patch is concerned with working out whether the compilation target is Armv6-M or not, which I'm doing by going through all the functions in the module, retrieving a TargetTransformInfo for each one, and querying it via a new method I've added to check its SubtargetInfo. If any function's TTI indicates that it's targeting an architecture supporting B.W, then we assume we're also allowed to use B.W in the jump table. The Armv6-M compatible jump table format requires a temporary register, and therefore also has to use the stack in order to restore that register. Another consequence of this change is that jump tables on Arm/Thumb are no longer always the same size. In particular, on an architecture that supports Arm and Thumb-1 but not Thumb-2, the Arm and Thumb tables are different sizes from //each other//. As a consequence, ``getJumpTableEntrySize`` can no longer base its answer on the target triple's architecture: it has to take into account the decision that ``selectJumpTableArmEncoding`` made, which meant I had to move that function to an earlier point in the code and store its answer in the ``LowerTypeTestsModule`` class. Reviewed By: lenary Differential Revision: https://reviews.llvm.org/D143576	2023-02-20 10:46:47 +00:00
Simon Tatham	bbef38352f	Revert "[LowerTypeTests] Support generating Armv6-M jump tables." This reverts commit f6ddf7781471b71243fa3c3ae7c93073f95c7dff. Eight buildbots reported that the two test files changed by that commit had started failing. The buildbots in question all had in common that they build with a very restricted `LLVM_TARGETS_TO_BUILD`, such as only X86 or AArch64 or Hexagon. I didn't notice this before commit because my own build has the full default set of targets, and in that circumstance, the tests pass. I assume the problem has something to do with the attempt to query TargetTransformInfo: if you can't make a valid TTI for the target triple then you can't ask it what kind of inline assembler you should be emitting, and so `opt` without the Arm backend can't get the Arm cases of these tests right. I don't have time to fix this until next week, so I'll revert the change for now to keep the buildbots happy.	2023-02-16 17:11:06 +00:00
Simon Tatham	f6ddf77814	[LowerTypeTests] Support generating Armv6-M jump tables. The LowerTypeTests pass emits a jump table in the form of an `inlineasm` IR node containing a string representation of some assembly. It tests the target triple to see what architecture it should be generating assembly for. But that's not good enough for `Triple::thumb`, because the 32-bit PC-relative `b.w` branch instruction isn't available in all supported architecture versions. In particular, Armv6-M doesn't support that instruction (although the similar Armv8-M Baseline does). Most of this patch is concerned with working out whether the compilation target is Armv6-M or not, which I'm doing by going through all the functions in the module, retrieving a TargetTransformInfo for each one, and querying it via a new method I've added to check its SubtargetInfo. If any function's TTI indicates that it's targeting an architecture supporting B.W, then we assume we're also allowed to use B.W in the jump table. The Armv6-M compatible jump table format requires a temporary register, and therefore also has to use the stack in order to restore that register. Another consequence of this change is that jump tables on Arm/Thumb are no longer always the same size. In particular, on an architecture that supports Arm and Thumb-1 but not Thumb-2, the Arm and Thumb tables are different sizes from //each other//. As a consequence, ``getJumpTableEntrySize`` can no longer base its answer on the target triple's architecture: it has to take into account the decision that ``selectJumpTableArmEncoding`` made, which meant I had to move that function to an earlier point in the code and store its answer in the ``LowerTypeTestsModule`` class. Reviewed By: lenary Differential Revision: https://reviews.llvm.org/D143576	2023-02-16 15:34:49 +00:00
Sander de Smalen	005311399e	[LoopVectorize][TTI] NFCI: Clarify enum for the tail folding style. This NFC (intended) patch has several small changes: * It renames PredicationStyle to TailFoldingStyle. * It renames TTI.emitActiveLaneMask() to TTI.getPreferredTailFoldingStyle() * Simplifies some of its uses in the LoopVectorizer Rationale: To my surprise PredicationStyle::None did not mean 'no predication', but rather 'no active lane mask intrinsic', such that the predicate is created using a splat + compare with stepvector. The enum is also highly specific to tail folding, so it seems better to name this around that feature, i.e. 'tail folding style'. This also makes it more amenable to extend it to other tail folding styles, such as the one added in D142109. Reviewed By: david-arm Differential Revision: https://reviews.llvm.org/D142887	2023-02-03 14:59:57 +00:00
Serguei Katkov	6188929dfb	[TTI][NFC] Introduce option to set predictable branch threshold Currently TargetTransformInfo::getPredictableBranchThreshold() method returns hardcoded value 99. This value affects the decision whether to convert select instruction to branch or not in several passes: SelectOptimize, CodeGenPrepare, SimplifyCFG. It would be useful to make possible to play with that threshold in order to test select-optimize heuristics. Option was originally introduced in the TargetLoweringBase, but was removed in the revision 664d0c052c315 and not restored in the TTI Patch Author: aleksandr.popov Reviewed By: spatel Differential Revision: https://reviews.llvm.org/D143060	2023-02-02 17:54:56 +07:00
ShihPo Hung	5fb3a57ea7	[Cost] Add CostKind to getVectorInstrCost and its related users LoopUnroll estimates the loop size via getInstructionCost(), but getInstructionCost() cannot pass CostKind to getVectorInstrCost(). And so does getShuffleCost() to getBroadcastShuffleOverhead(), getPermuteShuffleOverhead(), getExtractSubvectorOverhead(), and getInsertSubvectorOverhead(). To address this, this patch adds an argument CostKind to these functions. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D142116	2023-01-21 05:29:24 -08:00
Alexey Bataev	9b5f62685a	[SLP]Fix cost of the broadcast buildvector/gather. Need to include the cost of the initial insertelement to the cost of the broadcasts. Also, need to adjust the cost of the gather/buildvector if the element is inserted into poison/undef vector. Differential Revision: https://reviews.llvm.org/D140498	2023-01-06 09:25:05 -08:00
David Green	16a72a0f87	[AArch64] Enable the select optimize pass for AArch64 This enabled the select optimize patch for ARM Out of order AArch64 cores. It is trying to solve a problem that is difficult for the compiler to fix. The criteria for when a csel is better or worse than a branch depends heavily on whether the branch is well predicted and the amount of ILP in the loop (as well as other criteria like the core in question and the relative performance of the branch predictor). The pass seems to do a decent job though, with the inner loop heuristics being well implemented and doing a better job than I had expected in general, even without PGO information. I've been doing quite a bit of benchmarking. The headline numbers are these for SPEC2017 on a Neoverse N1: 500.perlbench_r -0.12% 502.gcc_r 0.02% 505.mcf_r 6.02% 520.omnetpp_r 0.32% 523.xalancbmk_r 0.20% 525.x264_r 0.02% 531.deepsjeng_r 0.00% 541.leela_r -0.09% 548.exchange2_r 0.00% 557.xz_r -0.20% Running benchmarks with a combination of the llvm-test-suite plus several versions of SPEC gave between a 0.2% and 0.4% geomean improvement depending on the core/run. The instruction count went down by 0.1% too, which is a good sign, but the results can be a little noisy. Some issues from other benchmarks I had ran were improved in rGca78b5601466f8515f5f958ef8e63d787d9d812e. In summary well predicted branches will see in improvement, badly predicted branches may get worse, and on average performance seems to be a little better overall. This patch enables the pass for AArch64 under -O3 for cores that will benefit for it. i.e. not in-order cores that do not fit into the "Assume infinite resources that allow to fully exploit the available instruction-level parallelism" cost model. It uses a subtarget feature for specifying when the pass will be enabled, which I have enabled under cpu=generic as the performance increases for out of order cores seems larger than any decreases for inorder, which were minor. Differential Revision: https://reviews.llvm.org/D138990	2022-12-03 16:08:58 +00:00
Krzysztof Parzyszek	86fe4dfdb6	TargetTransformInfo: convert Optional to std::optional Recommit: added missing "#include <cstdint>".	2022-12-02 11:42:15 -08:00
Krzysztof Parzyszek	4e12d1836a	Revert "TargetTransformInfo: convert Optional to std::optional" This reverts commit b83711248cb12639e7ef7303cfbb4452b4067e85. Some buildbots are failing.	2022-12-02 11:34:04 -08:00
Krzysztof Parzyszek	b83711248c	TargetTransformInfo: convert Optional to std::optional	2022-12-02 11:27:12 -08:00
Krzysztof Parzyszek	26424c96c0	Attributes: convert Optional to std::optional	2022-12-02 08:15:45 -06:00
Stanislav Mekhanoshin	bcaf31ec3f	[AMDGPU] Allow finer grain control of an unaligned access speed A target can return if a misaligned access is 'fast' as defined by the target or not. In reality there can be different levels of 'fast' and 'slow'. This patch changes the boolean 'Fast' argument of the allowsMisalignedMemoryAccesses family of functions to an unsigned representing its speed. A target can still define it as it wants and the direct translation of the current code uses 0 and 1 for current false and true. This makes the change an NFC. Subsequent patch will start using an actual value of speed in the load/store vectorizer to compare if a vectorized access going to be not just fast, but not slower than before. Differential Revision: https://reviews.llvm.org/D124217	2022-11-17 09:23:53 -08:00
Philip Reames	269bc684e7	[LV][RISCV] Disable vectorization of epilogue loops Epilogue loop vectorization is a feature in the vectorize intended to avoid running fully scalar code when the vector length of the main loop turns out to be either longer than the trip count of the actual loop, or with a huge remainder. In practice, this feature appears to not have been well tuned. I honestly don't think it should be on by default at all, but it definitely shouldn't be on for RISCV. Note that other targets have also disabled it, but they've done so via disabling interleaving - which is, well, completely unrelated - and we don't want to do that for RISCV. In the near term, many examples I'm seeing have terrible codegen for epilogue vectorization. We are greatly increasing code size for little value at reasonable VLEN values for small types. In the long term, the cases that epilogue vectorization are intended to handle are likely better handled via tail folding on RISCV. As an aside, I also don't really trust the correctness of epilogue vectorization. The code structure is such that otherwise straight forward changes sometimes break only epilogue vectorization. The reuse of an existing vplan without careful validation opens significant room for nasty bugs. Given how rarely the code is exercised, that is not a good combination. As such, this patch introduces a TTI hook, and completely disables epilogue vectorization on RISCV. Differential Revision: https://reviews.llvm.org/D136695	2022-10-25 14:28:02 -07:00
Shubham Narlawar	b920407cf5	[LICM] Disable thread-safety checks in single-thread model If the single-thread model is used, or the -licm-force-thread-model-single flag is specified, skip checks related to thread-safety. This means that store promotion for conditionally executed stores only requires proof of dereferenceability and writability, but not of thread-safety. For example, this enables promotion of stores to (non-constant) globals, as well as captured allocas. Fixes https://github.com/llvm/llvm-project/issues/50537. Differential Revision: https://reviews.llvm.org/D130466	2022-10-10 16:51:16 +02:00
Simon Pilgrim	c94cbc343e	Fix gcc warning about ambiguous if-else chain Fixes warnings introduced by D111968	2022-09-23 14:36:28 +01:00
Simon Pilgrim	a6e9141505	[TTI] Add OperandValueProperties::OP_NegatedPowerOf2 enum (PR51436) The mul by constant costmodels handle power-of-2 constants, but not negated-power-of-2, despite the backends handling both. This patch adds the OperandValueProperties::OP_NegatedPowerOf2 enum and wires it for use for basic mul cost analysis and SLP handling. Fixes #50778 Differential Revision: https://reviews.llvm.org/D111968	2022-09-23 14:03:18 +01:00

1 2 3 4 5 ...

470 Commits