llvm-project

Author	SHA1	Message	Date
Sjoerd Meijer	9bccf61f5f	[AArch64][LV] Set MaxInterleaving to 4 for Neoverse V2 and V3 (#100385 ) Set the maximum interleaving factor to 4, aligning with the number of available SIMD pipelines. This increases the number of vector instructions in the vectorised loop body, enhancing performance during its execution. However, for very low iteration counts, the vectorised body might not execute at all, leaving only the epilogue loop to run. This issue affects e.g. cam4_r from SPEC FP, which experienced a performance regression. To address this, the patch reduces the minimum epilogue vectorisation factor from 16 to 8, enabling the epilogue to be vectorised and largely mitigating the regression.	2024-11-20 09:33:39 +00:00
Sushant Gokhale	9991ea28fc	[CostModel][AArch64] Make extractelement, with fmul user, free whenev… (#111479 ) …er possible In case of Neon, if there exists extractelement from lane != 0 such that 1. extractelement does not necessitate a move from vector_reg -> GPR 2. extractelement result feeds into fmul 3. Other operand of fmul is a scalar or extractelement from lane 0 or lane equivalent to 0 then the extractelement can be merged with fmul in the backend and it incurs no cost. e.g. ``` define double @foo(<2 x double> %a) { %1 = extractelement <2 x double> %a, i32 0 %2 = extractelement <2 x double> %a, i32 1 %res = fmul double %1, %2 ret double %res } ``` `%2` and `%res` can be merged in the backend to generate: `fmul d0, d0, v0.d[1]` The change was tested with SPEC FP(C/C++) on Neoverse-v2. Compile time impact: None Performance impact: Observing 1.3-1.7% uplift on lbm benchmark with -flto depending upon the config.	2024-11-13 11:10:49 +05:30
Kazu Hirata	236fda550d	[Analysis] Remove unused includes (NFC) (#114936 ) Identified with misc-include-cleaner.	2024-11-05 19:11:34 -08:00
Nashe Mncube	e37d736def	Recommit: [llvm][ARM][GlobalOpt]Add widen global arrays pass (#113289 ) This is a recommit of #107120 . The original PR was approved but failed buildbot. The newly added tests should only be run for compilers that support the ARM target. This has been resolved by adding a config file for these tests. - Pass optimizes memcpy's by padding out destinations and sources to a full word to make ARM backend generate full word loads instead of loading a single byte (ldrb) and/or half word (ldrh). Only pads destination when it's a stack allocated constant size array and source when it's constant string. Heuristic to decide whether to pad or not is very basic and could be improved to allow more examples to be padded. - Pass works at the midend level	2024-10-24 10:12:01 +01:00
Nashe Mncube	370fd74361	Revert "[llvm][ARM]Add widen global arrays pass" (#112701 ) Reverts llvm/llvm-project#107120 Unexpected build failures in post-commit pipelines. Needs investigation	2024-10-17 13:38:01 +01:00
Nashe Mncube	ab90d2793c	[llvm][ARM]Add widen global arrays pass (#107120 ) - Pass optimizes memcpy's by padding out destinations and sources to a full word to make backend generate full word loads instead of loading a single byte (ldrb) and/or half word (ldrh). Only pads destination when it's a stack allocated constant size array and source when it's constant array. Heuristic to decide whether to pad or not is very basic and could be improved to allow more examples to be padded. - Pass works within GlobalOpt but is disabled by default on all targets except ARM.	2024-10-17 11:56:00 +01:00
Alexey Bataev	f9bc00e4bb	[SLP]Initial support for interleaved loads Adds initial support for interleaved loads, which allows emission of segmented loads for RISCV RVV. Vectorizes extra code for RISCV CFP2006/447.dealII, CFP2006/453.povray, CFP2017rate/510.parest_r, CFP2017rate/511.povray_r, CFP2017rate/526.blender_r, CFP2017rate/538.imagick_r, CINT2006/403.gcc, CINT2006/473.astar, CINT2017rate/502.gcc_r, CINT2017rate/525.x264_r Reviewers: RKSimon, preames Reviewed By: preames Pull Request: https://github.com/llvm/llvm-project/pull/112042	2024-10-14 09:12:33 -04:00
Tim Renouf	76007138f4	[LLVM] New NoDivergenceSource function attribute (#111832 ) A call to a function that has this attribute is not a source of divergence, as used by UniformityAnalysis. That allows a front-end to use known-name calls as an instruction extension mechanism (e.g. https://github.com/GPUOpen-Drivers/llvm-dialects ) without such a call being a source of divergence.	2024-10-12 09:34:45 +01:00
Shilei Tian	e34e27f198	[TTI][AMDGPU] Allow targets to adjust `LastCallToStaticBonus` via `getInliningLastCallToStaticBonus` (#111311 ) Currently we will not be able to inline a large function even if it only has one live use because the inline cost is still very high after applying `LastCallToStaticBonus`, which is a constant. This could significantly impact the performance because CSR spill is very expensive. This PR adds a new function `getInliningLastCallToStaticBonus` to TTI to allow targets to customize this value. Fixes SWDEV-471398.	2024-10-11 10:19:54 -04:00
Jeffrey Byrnes	853c43d04a	[TTI] NFC: Port TLI.shouldSinkOperands to TTI (#110564 ) Porting to TTI provides direct access to the instruction cost model, which can enable instruction cost based sinking without introducing code duplication.	2024-10-09 14:30:09 -07:00
Farzon Lotfi	63a0a81e73	[NFC][Scalarizer][TargetTransformInfo] Add isTargetIntrinsicWithScalarOpAtArg api (#111441 ) This change allows target intrinsics can have scalar args fixes [111440](https://github.com/llvm/llvm-project/issues/111440) This change will let us add scalarization for WaveReadLaneAt: https://github.com/llvm/llvm-project/pull/111010	2024-10-07 19:57:07 -04:00
Philip Reames	d288574363	[TTI][RISCV] Model cost of loading constants arms of selects and compares (#109824 ) This follows in the spirit of 7d82c99403f615f6236334e698720bf979959704, and extends the costing API for compares and selects to provide information about the operands passed in an analogous manner. This allows us to model the cost of materializing the vector constant, as some select-of-constants are significantly more expensive than others when you account for the cost of materializing the constants involved. This is a stepping stone towards fixing https://github.com/llvm/llvm-project/issues/109466. A separate SLP patch will be required to utilize the new API.	2024-09-25 07:25:57 -07:00
Farzon Lotfi	0f97b4824a	[Scalarizer][DirectX] Add support for scalarization of Target intrinsics (#108776 ) Since we are using the Scalarizer pass in the backend we needed a way to allow this pass to operate on Target intrinsics. We achieved this by adding `TargetTransformInfo ` to the Scalarizer pass. This allowed us to call a function available to the DirectX backend to know if an intrinsic is a target intrinsic that should be scalarized.	2024-09-17 11:35:42 -04:00
Philip Reames	27a62ec72a	[LSR] Split the -lsr-term-fold transformation into it's own pass (#104234 ) This transformation doesn't actually use any of the internal state of LSR and recomputes all information from SCEV. Splitting it out makes it easier to test. Note that long term I would like to write a version of this transform which is integrated with LSR's solver, but if that happens, we'll just delete the extra pass. Integration wise, I switched from using TTI to using a pass configuration variable. This seems slightly more idiomatic, and means we don't run the extra logic on any target other than RISCV.	2024-08-17 18:34:23 -07:00
Jeremy Morse	bde243259b	Revert "[Asan] Provide TTI hook to provide memory reference infromation of target intrinsics. (#97070 )" This reverts commit e8ad87c7d06afe8f5dde2e4c7f13c314cb3a99e9. This reverts commit d3c9bb0cf811424dcb8c848cf06773dbdde19965. A few buildbots trip up on asan-rvv-intrinsics.ll. I've also reverted the follow-up commit d3c9bb0cf8. https://lab.llvm.org/buildbot/#/builders/46/builds/2895	2024-08-08 12:26:05 +01:00
Yeting Kuo	e8ad87c7d0	[Asan] Provide TTI hook to provide memory reference infromation of target intrinsics. (#97070 ) Previously asan considers target intrinsics as black boxes, so asan could not instrument accurate check. This patch provide TTI hooks to make targets describe their intrinsic informations to asan. Note, 1. this patch renames InterestingMemoryOperand to MemoryRefInfo. 2. this patch does not support RVV indexed/segment load/store.	2024-08-08 13:40:26 +08:00
Fabian Ritter	9e462b7ea2	[LowerMemIntrinsics][NFC] Use Align in TTI::getMemcpyLoopLoweringType (#100984 ) ...and also in TTI::getMemcpyLoopResidualLoweringType.	2024-07-29 13:40:53 +02:00
Tianqing Wang	3d494bfc7f	[SimplifyCFG] Increase budget for FoldTwoEntryPHINode() if the branch is unpredictable. (#98495 ) The `!unpredictable` metadata has been present for a long time, but it's usage in optimizations is still limited. This patch teaches `FoldTwoEntryPHINode()` to be more aggressive with an unpredictable branch to reduce mispredictions. A TTI interface `getBranchMispredictPenalty()` is added to distinguish between different hardwares to ensure we don't go too far for simpler cores. For simplicity, only a naive x86 implementation is included for the time being.	2024-07-23 07:47:21 +08:00
Sjoerd Meijer	c5329c827a	[LV][AArch64] Prefer Fixed over Scalable if cost-model is equal (Neoverse V2) (#95819 ) For the Neoverse V2 we would like to prefer fixed width over scalable vectorisation if the cost-model assigns an equal cost to both for certain loops. This improves 7 kernels from TSVC-2 and several production kernels by about 2x, and does not affect SPEC21017 INT and FP. This also adds a new TTI hook that can steer the loop vectorizater to preferring fixed width vectorization, which can be set per CPU. For now, this is only enabled for the Neoverse V2. There are 3 reasons why preferring NEON might be better in the case the cost-model is a tie and the SVE vector size is the same as NEON (128-bit): architectural reasons, micro-architecture reasons, and SVE codegen reasons. The latter will be improved over time, so the more important reasons are the former two. I.e., (micro) architecture reason is the use of LPD/STP instructions which are not available in SVE2 and it avoids predication. For what it is worth: this codegen strategy to generate more NEON is inline with GCC's codegen strategy, which is actually even more aggressive in generating NEON when no predication is required. We could be smarter about the decision making, but this seems to be a first good step in the right direction, and we can always revise this later (for example make the target hook more general).	2024-07-17 10:46:28 +01:00
Sam Parker	d28ed29d6b	[TTI][WebAssembly] Pairwise reduction expansion (#93948 ) WebAssembly doesn't support horizontal operations nor does it have a way of expressing fast-math or reassoc flags, so runtimes are currently unable to use pairwise operations when generating code from the existing shuffle patterns. This patch allows the backend to select which, arbitary, shuffle pattern to be used per reduction intrinsic. The default behaviour is the same as the existing, which is by splitting the vector into a top and bottom half. The other pattern introduced is for a pairwise shuffle. WebAssembly enables pairwise reductions for int/fp add/sub.	2024-07-17 09:21:52 +01:00
Nikita Popov	9df71d7673	[IR] Add getDataLayout() helpers to Function and GlobalValue (#96919 ) Similar to https://github.com/llvm/llvm-project/pull/96902, this adds `getDataLayout()` helpers to Function and GlobalValue, replacing the current `getParent()->getDataLayout()` pattern.	2024-06-28 08:36:49 +02:00
Shengchen Kan	15fc801cf0	[X86][CodeGen] Support hoisting load/store with conditional faulting (#96720 ) 1. Add TTI interface for conditional load/store. 2. Mark 1 x i16/i32/i64 masked load/store legal so that it's not legalized in pass scalarize-masked-mem-intrin. 3. Visit 1 x i16/i32/i64 masked load/store to build a target-specific CLOAD/CSTORE node to avoid error in `DAGTypeLegalizer::ScalarizeVectorResult`. 4. Combine DAG to simplify the nodes for CLOAD/CSTORE. 5. Lower CLOAD/CSTORE to CFCMOV by pattern match. This is CodeGen part of #95515	2024-06-27 17:01:55 +08:00
Kazu Hirata	1462605ab0	[Analysis] Use range-based for loops (NFC) (#96587 )	2024-06-25 06:57:30 -07:00
Alex Bradbury	5a20141539	[LSR] Provide TTI hook to enable dropping solutions deemed to be unprofitable (#89924 ) <https://reviews.llvm.org/D126043> introduced a flag to drop solutions if deemed unprofitable. As noted there, introducing a TTI hook enables backends to individually opt into this behaviour. This will be used by #89927.	2024-06-05 14:40:58 +02:00
Tyler Lanphear	e8dd4df72b	[NFC][TTI] Mark `getReplicationShuffleCost()` as `const` (#92194 )	2024-05-22 09:44:10 -07:00
Graham Hunter	fbb37e9606	[AArch64] Add an all-in-one histogram intrinsic Based on discussion from https://discourse.llvm.org/t/rfc-vectorization-support-for-histogram-count-operations/74788 Current interface is: llvm.experimental.histogram(<vecty> ptrs, <intty> inc_amount, <vecty> mask) The integer type used by 'inc_amount' needs to match the type of the buckets in memory. The intrinsic covers the following operations: * Gather load * histogram on the elements of 'ptrs' * multiply the histogram results by 'inc_amount' * add the result of the multiply to the values loaded by the gather * scatter store the results of the add Supports lowering to histcnt instructions for AArch64 targets, and scalarization for all others at present.	2024-05-13 11:35:28 +01:00
Graham Hunter	2e8d815596	[TTI] Support scalable offsets in getScalingFactorCost (#88113 ) Part of the work to support vscale-relative immediates in LSR.	2024-05-10 11:22:11 +01:00
David Green	4ac2721e51	[AArch64] Add costs for ST3 and ST4 instructions, modelled as store(shuffle). (#87934 ) This tries to add some costs for the shuffle in a ST3/ST4 instruction, which are represented in LLVM IR as store(interleaving shuffle). In order to detect the store, it needs to add a CxtI context instruction to check the users of the shuffle. LD3 and LD4 are added, LD2 should be a zip1 shuffle, which will be added in another patch. It should help fix some of the regressions from #87510.	2024-04-09 16:36:08 +01:00
Graham Hunter	36a3f8f647	[TTI][TLI][AArch64] Support scalable immediates with isLegalAddImmediate (#84173 ) Adds a second parameter (default to 0) to isLegalAddImmediate, to represent a scalable immediate. Extends the AArch64 implementation to match immediates based on what addvl and inc[h\|w\|d] support.	2024-03-20 10:28:46 +00:00
Graham Hunter	cd768ec983	[AArch64] Support scalable offsets with isLegalAddressingMode (#83255 ) Allows us to indicate that an addressing mode featuring a vscale-relative immediate offset is supported.	2024-03-20 10:13:20 +00:00
Paschalis Mpeis	f795d1a8b1	[AArch64][LV][SLP] Vectorizers use call cost for vectorized frem (#82488 ) getArithmeticInstrCost is used by both LoopVectorizer and SLPVectorizer to compute the cost of frem, which becomes a call cost on AArch64 when TLI has a vector library function. Add tests that do SLP vectorization for code that contains 2x double and 4x float frem instructions.	2024-03-14 17:20:29 +00:00
Kolya Panchenko	889d99a50f	[TTI] Add alignment argument to TTI for compress/expand support (#83516 ) Since `llvm.compressstore` and `llvm.expandload` do require memory access, it's essential for some target to check if alignment is good to be able to lower them to target-specific instructions	2024-03-05 20:33:56 -05:00
Alexey Bataev	8ad14b6d90	[TTI]Add support for strided loads/stores. Added basic legality check and cost estimation functions for strided loads and stores. These interfaces will be built upon in https://github.com/llvm/llvm-project/pull/80310. Reviewers: preames Reviewed By: preames Pull Request: https://github.com/llvm/llvm-project/pull/80329	2024-02-01 16:07:38 -05:00
David Green	a2d68b4bec	[SelectOpt] Add handling for Select-like operations. (#77284 ) Some operations behave like selects. For example `or(zext(c), y)` is the same as select(c, y\|1, y)` and instcombine can canonicalize the select to the or form. These operations can still be worthwhile converting to branch as opposed to keeping as a select or or instruction. This patch attempts to add some basic handling for them, creating a SelectLike abstraction in the select optimization pass. The backend can opt into handling `or(zext(c),x)` as a select if it could be profitable, and the select optimization pass attempts to handle them in much the same way as a `select(c, x\|1, x)`. The Or(x, 1) may need to be added as a new instruction, generated as the or is converted to branches. This helps fix a regression from selects being converted to or's recently.	2024-01-22 23:46:58 +00:00
David Sherwood	c7148467fc	[AArch64] Add an AArch64 pass for loop idiom transformations (#72273 ) We have added a new pass that looks for loops such as the following: ``` while (i != max_len) if (a[i] != b[i]) break; ... use index i ... ``` Although similar to a memcmp, this is slightly different because instead of returning the difference between the values of the first non-matching pair of bytes, it returns the index of the first mismatch. As such, we are not able to lower this to a memcmp call. The new pass can now spot such idioms and transform them into a specialised predicated loop that gives a significant performance improvement for AArch64. It is intended as a stop-gap solution until this can be handled by the vectoriser, which doesn't currently deal with early exits. This specialised loop makes use of a generic intrinsic that counts the trailing zero elements in a predicate vector. This was added in https://reviews.llvm.org/D159283 and for SVE we end up with brkb & incp instructions. Although we have added this pass only for AArch64, it was written in a generic way so that in theory it could be used by other targets. Currently the pass requires scalable vector support and needs to know the minimum page size for the target, however it's possible to make it work for fixed-width vectors too. Also, the llvm.experimental.cttz.elts intrinsic used by the pass has generic lowering, but can be made efficient for targets with instructions similar to SVE's brkb, cntp and incp. Original version of patch was posted on Phabricator: https://reviews.llvm.org/D158291 Patch co-authored by Kerry McLaughlin (@kmclaughlin-arm) and David Sherwood (@david-arm) See the original discussion on Discourse: https://discourse.llvm.org/t/aarch64-target-specific-loop-idiom-recognition/72383	2024-01-09 11:29:28 +00:00
Alexey Bataev	5096501082	[SLP][TTI][X86]Add addsub pattern cost estimation. (#76461 ) SLP/TTI do not know about the cost estimation for addsub pattern, supported by X86. Previously the support for pattern detection was added (seeTTI::isLegalAltInstr), but the cost still did not estimated properly.	2023-12-28 05:04:04 -08:00
Douglas Yung	fb981e6b4b	Revert "[SLP][TTI][X86]Add addsub pattern cost estimation. (#76461 )" This reverts commit bc8c4bbd7973ab9527a78a20000aecde9bed652d. Change is failing to build on several bots: - https://lab.llvm.org/buildbot/#/builders/127/builds/60184 - https://lab.llvm.org/buildbot/#/builders/123/builds/23709 - https://lab.llvm.org/buildbot/#/builders/216/builds/32302	2023-12-27 23:52:04 -08:00
Alexey Bataev	bc8c4bbd79	[SLP][TTI][X86]Add addsub pattern cost estimation. (#76461 ) SLP/TTI do not know about the cost estimation for addsub pattern, supported by X86. Previously the support for pattern detection was added (seeTTI::isLegalAltInstr), but the cost still did not estimated properly.	2023-12-27 15:57:21 -05:00
Paul Walker	930b5b52ff	[ConstantHoisting] Add a TTI hook to prevent hoisting. (#69004 ) Code generation can sometimes simplify expensive operations when an operand is constant. An example of this is divides on AArch64 where they can be rewritten using a cheaper sequence of multiplies and subtracts. Doing this is often better than hoisting expensive constants which are likely to be hoisted by MachineLICM anyway.	2023-12-13 17:20:36 +00:00
Philip Reames	e947f95337	[LSR][TTI][RISCV] Enable terminator folding for RISC-V If looking for a miscompile revert candidate, look here! The transform being enabled prefers comparing to a loop invariant exit value for a secondary IV over using an otherwise dead primary IV. This increases register pressure (by requiring the exit value to be live through the loop), but reduces the number of instructions within the loop by one. On RISC-V which has a large number of scalar registers, this is generally a profitable transform. We loose the ability to use a beqz on what is typically a count down IV, and pay the cost of computing the exit value on the secondary IV in the loop preheader, but save an add or sub in the loop body. For anything except an extremely short running loop, or one with extreme register pressure, this is profitable. On spec2017, we see a 0.42% geomean improvement in dynamic icount, with no individual workload regressing by more than 0.25%. Code size wise, we trade a (possibly compressible) beqz and a (possibly compressible) addi for a uncompressible beq. We also add instructions in the preheader. Net result is a slight regression overall, but neutral or better inside the loop. Previous versions of this transform had numerous cornercase correctness bugs. All of them ones I can spot by inspection have been fixed, and I have run this through all of spec2017, but there may be further issues lurking. Adding uses to an IV is a fraught thing to do given poison semantics, so this transform is somewhat inherently risky. This patch is a reworked version of D134893 by @eop. That patch has been abandoned since May, so I picked it up, reworked it a bit, and am landing it.	2023-11-29 12:04:06 -08:00
Sander de Smalen	00a831421f	[AArch64][SME] Extend Inliner cost-model with custom penalty for calls. (#68416 ) This is a stacked PR following on from #68415 This patch has two purposes: (1) It tries to make inlining more likely when it can avoid a streaming-mode change. (2) It avoids inlining when inlining causes more streaming-mode changes. An example of (1) is: ``` void streaming_compatible_bar(void); void foo(void) __arm_streaming { /* other code / streaming_compatible_bar(); / other code / } void f(void) { foo(); // expensive streaming mode change } -> void f(void) { / other code / streaming_compatible_bar(); / other code */ } ``` where it wouldn't have inlined the function when foo would be a non-streaming function. An example of (2) is: ``` void streaming_bar(void) __arm_streaming; void foo(void) __arm_streaming { streaming_bar(); streaming_bar(); } void f(void) { foo(); // expensive streaming mode change } -> (do not inline into) void f(void) { streaming_bar(); // these are now two expensive streaming mode changes streaming_bar(); }```	2023-10-31 10:28:40 +00:00
Mingming Liu	aa6ee03709	[NFC][Inliner] Introduce another multiplier for cost benefit analysis and make multipliers overriddable in TargetTransformInfo. - The motivation is to expose tunable knobs to control the aggressiveness of inlines for different backend (e.g., machines with different icache size, and workload with different icache/itlb PMU counters). Tuning inline aggressiveness shows a small (~+0.3%) but stable improvement on workload/hardware that is more frontend bound. - Both multipliers could be overridden from command line. Reviewed By: kazu Differential Revision: https://reviews.llvm.org/D153154	2023-10-02 21:27:07 -07:00
David Green	12025cef3e	[CostModel] Use min/max intrinsics for vecreduce.min/max costs This changes the costmodelling of the vecreduce.min/max nodes to use the costs of the relevant min/max intrinsics instead of expanding them to compare and selects. The getMinMaxReductionCost have changed to take a Opcode for the relevant intrinsic, dropping the IsUnsigned and CondTy parameters as they are no longer needed. A follow up patch will add some basic fminimum/fmaximum costmodelling. Differential Revision: https://reviews.llvm.org/D153547	2023-07-04 15:02:30 +01:00
Luke Lau	a68dcd09e8	[TTI] Use users of GEP to guess access type in getGEPCost Currently getGEPCost uses the target type of the GEP as a heuristic for the type that will be accessed, to pass onto isLegalAddressingMode. Targets use this to work out if a GEP can then be folded into the load/store instruction that uses the GEP. For example, on RISC-V loads and stores can have an offset added to a base register folded into a single instruction, so the following GEP is free: %p = getelementptr i32, ptr %base, i32 42 ; getInstructionCost = 0 %x = load i32, ptr %p ; getInstructionCost = 1 ------------------------------------------------------------------------ lw t0, a0(42) However vector loads and stores cannot have an offset folded into them, so the following GEP is costed: %p = getelementptr <2 x i32>, ptr %base, i32 42 ; getInstructionCost = 1 %x = load <2 x i32>, ptr %p ; getInstructionCost = 1 ------------------------------------------------------------------------ addi a0, 42 vle32 v8, (a0) The issue arises whenever there is a mismatch between the target type of the GEP and the type that is actually accessed: %p = getelementptr i32, ptr %base, i32 42 ; getInstructionCost = 0 %x = load <2 x i32>, ptr %p ; getInstructionCost = 1 ------------------------------------------------------------------------ addi a0, 42 vle32 v8, (a0) Even though this GEP will result in an add instruction, because TTI thinks it's loading an i32, it will think it can be folded and not charge for it. The target type can become mismatched with the memory access during transformations, noticeably during SLP where a scalar base pointer will be reused to perform a vector load or store. This patch adds an optional AccessType argument to getGEPCost which allows the type of memory accessed by users to be passed in as a hint, so that we can more accurately determine if the GEP can be folded into its users. If AccessType is not provided, getGEPCost falls back to the old behaviour of using the PointeeType to guess the memory access type. This can be revisited in a later patch. Also for now, only GEPs with exactly one user use the access type hint. Whilst we could look through all users and use all access types to determine if we can fold the GEP, this patch avoids doing so to prevent O(N) behaviour. Differential Revision: https://reviews.llvm.org/D149889	2023-06-29 13:44:37 +01:00
Juan Manuel MARTINEZ CAAMAÑO	cc8a346e3f	[InlineCost][TargetTransformInfo][AMDGPU] Consider cost of alloca instructions in the caller (1/2) On AMDGPU, alloca instructions have penalty that can be avoided when SROA is applied after inlining. This patch introduces the default implementation of TargetTransformInfo::getCallerAllocaCost. Reviewed By: mtrofin Differential Revision: https://reviews.llvm.org/D149740	2023-06-29 09:49:16 +02:00
Matt Arsenault	12c12c5fe0	TTI: Add function to hasBranchDivergence It my be possible to contextually ignore divergence in a function if it's known to run single threaded.	2023-06-16 18:47:40 -04:00
Matt Arsenault	a09f79d227	TargetTransformInfo: Add addrspacesMayAlias For some reason we used to only handle address space aliasing through chaining a target specific AA pass. We need never-fail simple queries in order to lower memmove intrinsics based purely on the address spaces. I also think it would be better if BasicAA checked this, rather than relying on the target AA passes. Currently we go through the more expensive AA analyses before getting to the trivial address space checks.	2023-06-13 20:44:00 -04:00
Matt Arsenault	3c848194f2	CodeGen: Expand memory intrinsics in PreISelIntrinsicLowering Expand large or unknown size memory intrinsics into loops in the default lowering pipeline if the target doesn't have the corresponding libfunc. Previously AMDGPU had a custom pass which existed to call the expansion utilities. With a default no-libcall option, we can remove the libfunc checks in LoopIdiomRecognize for these, which never made any sense. This also provides a path to lifting the immarg restriction on llvm.memcpy.inline. There seems to be a bug where TLI reports functions as available if you use -march and not -mtriple.	2023-06-09 21:04:37 -04:00
Luke Lau	c27a0b21c5	[SLP][RISCV] Account for offset folding in getPointersChainCost For a GEP in a pointer chain, if: 1) a pointer chain is unit-strided 2) the base pointer wasn't folded and is sitting in a register somewhere 3) the distance between the GEP and the base pointer is small enough and can be folded into the addressing mode of the using load/store Then we can exclude that GEP from the total cost of the pointer chain, as it will likely be folded away. In order to check if 3) holds, we need to know the type of memory access being made by the users of the pointer chain. For that, we need to pass along a new argument to getPointersChainCost. (Using the source pointer type of the GEP isn't accurate, see https://reviews.llvm.org/D149889 for more details). Also note that 2) is currently an assumption, and could be modelled more accurately. This prevents some unprofitable cases from being SLP vectorized on RISC-V by making the scalar costs cheaper and closer to the actual codegen. For now the getPointersChainCost hook is duplicated for RISC-V to prevent disturbing other targets, but could be merged back in and shared with other targets in a following patch. Reviewed By: ABataev Differential Revision: https://reviews.llvm.org/D149654	2023-05-22 13:55:30 +01:00
Yonghong Song	da816c2985	[TTI][BPF] Ensure ArgumentPromotion Not Exceeding Target MaxArgs With LLVM patch https://reviews.llvm.org/D148269, we hit a linux kernel bpf selftest compilation failure like below: ... progs/test_xdp_noinline.c:739:8: error: too many args to t8: i64 = GlobalAddress<ptr @encap_v4> 0, progs/test_xdp_noinline.c:739:8 if (!encap_v4(xdp, cval, &pckt, dst, pkt_bytes)) ^ ... progs/test_xdp_noinline.c:321:6: error: defined with too many args bool encap_v4(struct xdp_md xdp, struct ctl_value cval, ^ ... Note that bpf selftests are compiled with -O2 which is the recommended flag for bpf community. The bpf backend calling convention is only allowing 5 parameters in registers and does not allow pass arguments through stacks. In the above case, ArgumentPromotionPass replaced parameter '&pckt' as two parameters, so the total number of arguments after ArgumentPromotion pass becomes 6 and this caused later compilation failure during instruction selection phase. This patch added a TargetTransformInfo hook getMaxNumArgs() which returns 5 for BPF and UINT_MAX for other targets. Differential Revision: https://reviews.llvm.org/D148551	2023-04-19 09:09:20 -07:00

1 2 3 4 5 ...

497 Commits