llvm-project

Author	SHA1	Message	Date
Florian Hahn	8098f2577e	[LV] Use Legal::isUniform to detect uniform pointers. Update collectLoopUniforms to identify uniform pointers using Legal::isUniform. This is more powerful and brings pointer classification here in sync with setCostBasedWideningDecision which uses isUniformMemOp. The existing mis-match in reasoning can causes crashes due to D134460, which is fixed by this patch. Fixes https://github.com/llvm/llvm-project/issues/60831. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D150991	2023-05-30 16:42:55 +01:00
Florian Hahn	1a28b9bce7	[VPlan] Handle invariant GEPs in isUniformAfterVectorization. This fixes a crash caused by legal treating a scalable GEP as invariant, but isUniformAfterVectorization does not handle GEPs. Partially fixes https://github.com/llvm/llvm-project/issues/60831. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D144434	2023-05-30 15:53:26 +01:00
Florian Hahn	b750862107	[LV] Use early exit for stores storing the ptr operand. (NFC) Cleanup suggested in D150991.	2023-05-30 12:14:12 +01:00
Kazu Hirata	75c75215e3	[Vectorize] Remove unused declaration requiresTooManyRuntimeChecks The corresponding function definition was removed by: commit 644a965c1efef68f22d9495e4cefbb599c214788 Author: Florian Hahn <flo@fhahn.com> Date: Mon Jul 4 15:10:48 2022 +0100	2023-05-29 11:56:50 -07:00
Justin Lebar	420cf6927c	[LSV] Return same bitwidth from getConstantOffset. Previously, getConstantOffset could return an APInt with a different bitwidth than the input pointers. For example, we might be loading an opaque 64-bit pointer, but stripAndAccumulateInBoundsConstantOffsets might give a 32-bit offset. This was OK in most cases because in gatherChains, we casted the APInt back to the original ASPtrBits. But it was not OK when considering selects. We'd call getConstantOffset twice and compare the resulting APInt's, which might not have the same bit width. This fixes that. Now getConstantOffset always returns offsets with the correct width, so we don't need the hack of casting it in gatherChains, and it works correctly when we're handling selects. Differential Revision: https://reviews.llvm.org/D151640	2023-05-29 08:43:47 -07:00
Justin Lebar	f225471c68	[LSV] Fix the ContextInst for computeKnownBits. Previously we used the later of GEPA or GEPB. This is hacky because really we should be using the later of the two load/store instructions being considered. But also it's flat-out incorrect, because GEPA and GEPB might be in different BBs, in which case we cannot ask which one comes last (assertion failure, https://reviews.llvm.org/D149893#4378332). Fixed, now we use the correct context instruction. Differential Revision: https://reviews.llvm.org/D151630	2023-05-28 08:00:52 -07:00
Kazu Hirata	b1b04ed96a	[Vectorize] Fix warnings This patch fixes: llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp:140:20: error: unused function 'operator<<' [-Werror,-Wunused-function] llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp:176:6: error: unused function 'dumpChain' [-Werror,-Wunused-function]	2023-05-26 17:27:25 -07:00
Kazu Hirata	0508ac32cf	[Vectorize] Fix a warning This patch fixes: llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp:1429:23: error: comparison of integers of different signs: 'int' and 'const size_t' (aka 'const unsigned long') [-Werror,-Wsign-compare]	2023-05-26 17:02:34 -07:00
Justin Lebar	8d57b00f96	Fix -Wsign-compare from D149893.	2023-05-26 16:22:16 -07:00
Justin Lebar	2be0abb7fe	Rewrite load-store-vectorizer. The motivation for this change is a workload generated by the XLA compiler targeting nvidia GPUs. This kernel has a few hundred i8 loads and stores. Merging is critical for performance. The current LSV doesn't merge these well because it only considers instructions within a block of 64 loads+stores. This limit is necessary to contain the O(n^2) behavior of the pass. I'm hesitant to increase the limit, because this pass is already one of the slowest parts of compiling an XLA program. So we rewrite basically the whole thing to use a new algorithm. Before, we compared every load/store to every other to see if they're consecutive. The insight (from tra@) is that this is redundant. If we know the offset from PtrA to PtrB, then we don't need to compare PtrC to both of them in order to tell whether C may be adjacent to A or B. So that's what we do. When scanning a basic block, we maintain a list of chains, where we know the offset from every element in the chain to the first element in the chain. Each instruction gets compared only to the leaders of all the chains. In the worst case, this is still O(n^2), because all chains might be of length 1. To prevent compile time blowup, we only consider the 64 most recently used chains. Thus we do no more comparisons than before, but we have the potential to make much longer chains. This rewrite affects many tests. The changes to tests fall into two categories. 1. The old code had what appears to be a bug when deciding whether a misaligned vectorized load is fast. Suppose TTI reports that load <i32 x 4> align 4 has relative speed 1, and suppose that load i32 align 4 has relative speed 32. The intent of the code seems to be that we prefer the scalar load, because it's faster. But the old code would choose the vectorized load. accessIsMisaligned would set RelativeSpeed to 0 for the scalar load (and not even call into TTI to get the relative speed), because the scalar load is aligned. After this patch, we will prefer the scalar load if it's faster. 2. This patch changes the logic for how we vectorize. Usually this results in vectorizing more. Explanation of changes to tests: - AMDGPU/adjust-alloca-alignment.ll: #1 - AMDGPU/flat_atomic.ll: #2, we vectorize more. - AMDGPU/int_sideeffect.ll: #2, there are two possible locations for the call to @foo, and the pass is brittle to this. Before, we'd vectorize in case 1 and not case 2. Now we vectorize in case 2 and not case 1. So we just move the call. - AMDGPU/adjust-alloca-alignment.ll: #2, we vectorize more - AMDGPU/insertion-point.ll: #2 we vectorize more - AMDGPU/merge-stores-private.ll: #1 (undoes changes from git rev 86f9117d476, which appear to have hit the bug from #1) - AMDGPU/multiple_tails.ll: #1 - AMDGPU/vect-ptr-ptr-size-mismatch.ll: Fix alignment (I think related to #1 above). - AMDGPU CodeGen: I have difficulty commenting on these changes, but many of them look like #2, we vectorize more. - NVPTX/4x2xhalf.ll: Fix alignment (I think related to #1 above). - NVPTX/vectorize_i8.ll: We don't generate <3 x i8> vectors on NVPTX because they're not legal (and eventually get split) - X86/correct-order.ll: #2, we vectorize more, probably because of changes to the chain-splitting logic. - X86/subchain-interleaved.ll: #2, we vectorize more - X86/vector-scalar.ll: #2, we can now vectorize scalar float + <1 x float> - X86/vectorize-i8-nested-add-inseltpoison.ll: Deleted the nuw test because it was nonsensical. It was doing `add nuw %v0, -1`, but this is equivalent to `add nuw %v0, 0xffff'ffff`, which is equivalent to asserting that %v0 == 0. - X86/vectorize-i8-nested-add.ll: Same as nested-add-inseltpoison.ll Differential Revision: https://reviews.llvm.org/D149893	2023-05-26 15:15:39 -07:00
Alexey Bataev	95b631181a	[SLP]Fix getSpillCost functions. There are several issues in the current implementation. The instructions are not properly ordered, if they are placed in different basic blocks, need to reverse the order of blocks. Also, need to exclude non-vectorizable nodes and check for CallBase, not CallInst, otherwise invoke calls are not handled correctly.	2023-05-26 12:19:28 -07:00
Alexander Timofeev	bad4de1ae7	Don't disable loop unroll for vectorized loops on AMDGPU target We've got a performance regression after the https://reviews.llvm.org/D115261. Despite the loop being vectorized unroll is still required. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D149281	2023-05-25 22:54:41 +02:00
Craig Topper	6006d43e2d	LLVM_FALLTHROUGH => [[fallthrough]]. NFC Reviewed By: MaskRay Differential Revision: https://reviews.llvm.org/D150996	2023-05-24 12:40:10 -07:00
Florian Hahn	299f0ff60e	[VPlan] Print IR flags for VPRecipeWithIRFlags. Now that IR flags are modeled as part of VPRecipeWithIRFlags, include the flags when printing recipes. Depends on D150027. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D150029	2023-05-23 20:36:16 +01:00
Alexey Bataev	ae5ff3ca0c	[SLP]Fix PR62665: compiler crash when trying to access non-existing mask element. Need to check at first if the SubMask element is PoisonMaskElem to avoid compiler crash.	2023-05-22 13:43:25 -07:00
Luke Lau	c27a0b21c5	[SLP][RISCV] Account for offset folding in getPointersChainCost For a GEP in a pointer chain, if: 1) a pointer chain is unit-strided 2) the base pointer wasn't folded and is sitting in a register somewhere 3) the distance between the GEP and the base pointer is small enough and can be folded into the addressing mode of the using load/store Then we can exclude that GEP from the total cost of the pointer chain, as it will likely be folded away. In order to check if 3) holds, we need to know the type of memory access being made by the users of the pointer chain. For that, we need to pass along a new argument to getPointersChainCost. (Using the source pointer type of the GEP isn't accurate, see https://reviews.llvm.org/D149889 for more details). Also note that 2) is currently an assumption, and could be modelled more accurately. This prevents some unprofitable cases from being SLP vectorized on RISC-V by making the scalar costs cheaper and closer to the actual codegen. For now the getPointersChainCost hook is duplicated for RISC-V to prevent disturbing other targets, but could be merged back in and shared with other targets in a following patch. Reviewed By: ABataev Differential Revision: https://reviews.llvm.org/D149654	2023-05-22 13:55:30 +01:00
Florian Hahn	8eaf7a75fe	[VPlan] Add missing ifdef after 96686796f606. Fixes build with debug printing disabled.	2023-05-22 10:44:17 +01:00
Florian Hahn	96686796f6	[VPlan] Move live-out printing to VPLiveOut::print (NFC). Preparation for D150398. This brings live-out printing in line with how printing for recipes is handled.	2023-05-22 09:53:53 +01:00
Vasileios Porpodas	806dea46be	[SLP] Cleanup: Remove `tryToVectorizePair()`, most probably NFC `tryToVectorizePair()` adds a level of indirection over `tryToVectorizeList()`. I am not really sure why it is needed, it looks redundant. I replaced all calls to `tryToVectorizePair()` with calls to `tryToVectorizeList()` and I am not seeing any failures. Differential Revision: https://reviews.llvm.org/D151004	2023-05-19 20:25:20 -07:00
Vasileios Porpodas	338fc76200	[SLP][NFC] Cleanup: Remove KeyNodes set. I don't see a good reason form having the `KeyNodes` set. This patch removes the set. Differential Revision: https://reviews.llvm.org/D150918	2023-05-19 10:30:02 -07:00
Florian Hahn	55903151a2	[VPlan] Use isUniformAfterVec in VPReplicateRecipe::execute. I was unable to find a case where this actually changes generated code, but it enables the bug fix in D144434. It also brings codegen in line with the handling of stores to uniform addresses in the cost model (D134460). Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D144491	2023-05-19 18:15:21 +01:00
Alexey Bataev	9a7248f561	[SLP]Fix crash for scalarized vectors. Need to remove insertion of the nodes to the InVector in case of scalarized vectors too to avoid compiler crashes.	2023-05-17 06:32:22 -07:00
Luke Lau	d088b8af93	[SLP] Rename IsUniformStride to IsUnitStride. NFCI IsUniformStride is only used when the stride is a unit-stride, i.e. in a plain wide vector load. This tightens the condition and renames it to isUnitStride. It removes the old unused getUniformStrided() variant, as isUnitStride should now imply that the stride is known. Reviewed By: vdmitrie, ABataev Differential Revision: https://reviews.llvm.org/D150662	2023-05-17 13:21:33 +01:00
Alexey Bataev	6c7acc6409	[SLP][NFC]Add missing finalize params in the CostEstimator, NFC. Prepare functions for generalization of codegen/cost estimation. Differential Revision: https://reviews.llvm.org/D150121	2023-05-15 11:17:37 -07:00
Vasileios Porpodas	ddb2188afc	[SLP][NFC] Cleanup: Separate vectorization of Inserts and CmpInsts. This deprecates `vectorizeSimpleInstructions()` and replaces it with separate functions that vectorize CmpInsts and Inserts. Differential Revision: https://reviews.llvm.org/D149993	2023-05-15 10:12:34 -07:00
Florian Hahn	701f7230cd	[VPlan] Use VPRecipeWithIRFlags for VPReplicateRecipe, retire poison map Update VPReplicateRecipe to use VPRecipeWithIRFlags for IR flag handling. Retire separate MayGeneratePoisonRecipes map. Depends on D149082. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D150027	2023-05-15 11:49:20 +01:00
Florian Hahn	f40a7901d1	[LV] Move selecting vectorization factor logic to LVP (NFC). Split off from D143938. This moves the planning logic to select the vectorization factor to LoopVectorizationPlanner as a step towards only computing costs for individual VFs in LoopVectorizationCostModel and do planning in LVP. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D150197	2023-05-13 12:28:14 +01:00
Florian Hahn	7472f1da96	[VPlan] Change LoopVectorizationPlanner::TTI to be const reference (NFC)	2023-05-13 12:27:57 +01:00
Florian Hahn	0418d0242b	[LV] Move getVScaleForTuning out of LoopVectorizationCostModel (NFC). Split off refactoring from D150197 to reduce diff.	2023-05-13 10:17:13 +01:00
Philip Reames	592199c8fe	[LV] Use interface routines instead of internal variables This makes a (possible) change to the internal representation easier in the future, and makes the code easier to read now.	2023-05-12 16:27:12 -07:00
Florian Hahn	bf279a0f8e	[VPlan] Remove dangling comment and newlines (NFC). Apply missed cleanups.	2023-05-11 22:06:56 +01:00
Florian Hahn	3d4eed0133	[LV] Reuse SCEV expansion results for epilogue vectorization. When generating code for the epilogue vector loop, we need to re-use the expansion results for induction steps generated for the main vector loop, as the pre-header of the epilogue vector loop may not dominate the vector preheader of the epilogue. This fixes a reported crash. Note that this is a workaround which should be removed soon once induction resume value creation is handled in VPlan directly.	2023-05-11 22:00:07 +01:00
Philip Reames	7fbfcc653f	[LV/LAA] Use PSE to identify stride multiplies which simplify [mostly nfc] LV/LAA will speculate that (some) strided access patterns have unit stride, and insert runtime checks if required. LV cost models a multiply by such a stride as free. We did this by keeping around the StrideSet structure, just to check if one of the operands were one of the strides we speculated. We can instead just ask PredicatedScalarEvolution if either of the operands are one (after predicates are applied). We get mostly the same result - PSE can prove it in more cases in theory - and simpler code.	2023-05-11 11:16:04 -07:00
Philip Reames	e41dce4d49	[LAA/LV] Simplify stride speculation logic [NFC] (try 2) The original commit wasn't quite NFC, and this was caught by an arguably overly strong assert. Specifically, I'd failed to strip off the integer cast off the SCEV before saving it in the map. The result - other than a failed assert - is that we'd speculate on the casted unknown, not the unknown. The only case I can think of where that might change behavior would be a sext(i1 load). I doubt that case is interesting in practice, but it's good to be strictly NFC on this change regardless. Original commit message follows.. The existing code makes it hard to tell that collectStridedAccess is really about identifying some loop invariant SCEV which is profitable to speculate is equal to one. The odd dual usage structure of Value and SCEV confuses this point. We could choose to loosen the profitability analysis if desired. I'm not proposing doing so at this time as it exposes too many cases where the speculation is unprofitable. Differential Revision: https://reviews.llvm.org/D147750	2023-05-11 10:19:23 -07:00
Philip Reames	dc0d00c5fc	Revert "[LAA/LV] Simplify stride speculation logic [NFC]" This reverts commit d5b840131223f2ffef4e48ca769ad1eb7bb1869a. Running this through broader testing after rebasing is revealing a crash. Reverting while I investigate.	2023-05-11 09:26:35 -07:00
Florian Hahn	236a0e82df	[LV] Use VPValue to get expanded value for SCEV step expressions. Update skeleton creation logic to use SCEV expansion results from expanding the pre-header. This avoids another set of SCEV expansions that may happen after the CFG has been modified. Fixes #58811. Depends on D147964. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D147965	2023-05-11 16:49:19 +01:00
Philip Reames	d5b8401312	[LAA/LV] Simplify stride speculation logic [NFC] The existing code makes it hard to tell that collectStridedAccess is really about identifying some loop invariant SCEV which is profitable to speculate is equal to one. The odd dual usage structure of Value and SCEV confuses this point. We could choose to loosen the profitability analysis if desired. I'm not proposing doing so at this time as it exposes too many cases where the speculation is unprofitable. Differential Revision: https://reviews.llvm.org/D147750	2023-05-11 08:32:56 -07:00
Hongtao Yu	9272d0f079	[PseudoProbe] Clean up dwarf discriminator and avoid duplicating factor. A pseudo probe is created with dwarf line information shared with its nearest instruction. If the instruction comes with a dwarf discriminator, it will be shared with the probe as well. This can confuse the later FS-AFDO discriminator assignment pass. To fix this, I'm cleaning up the discriminator fields for probes when they are inserted. I also notice another possibility to change the discriminator field of pseudo probes in the pipeline before the FS discriminator assignment pass. That is the loop unroller, which assigns duplication factor to instruction being vectorized. I'm disabling that for pseudo probe intrinsics specifically, also for callsites with probes. Reviewed By: wenlei Differential Revision: https://reviews.llvm.org/D148569	2023-05-10 11:26:23 -07:00
Vasileios Porpodas	dda2a5d457	[SLP][NFC] Rename a couple of variables and replace an if-else with an std::min - Rename `LimitForRegisterSize` to `MaxVFOnly` to make the meaning of the limit less ambiguous - Rename `OpsWidth` to `ActualVF`, which makes it clear that this is the VF we are using for vectorization. - Replace the if-else code for the initialization of OpsWidth with an std::min. Differential Revision: https://reviews.llvm.org/D150241	2023-05-10 09:37:58 -07:00
Florian Hahn	c096e91735	[VPlan] Address missed suggestions from D149082. This address 2 comments missed from D149082. It sets inbounds directly when creating the GEP and fixes the order in the enum.	2023-05-09 15:17:20 +01:00
Florian Hahn	5f3343985b	[VPlan] Use VPRecipeWithIRFlags for VPWidenGEPRecipe (NFCI). Extend VPRecipeWithIRFlags to also include InBounds and use for VPWidenGEPRecipe. The last remaining recipe that needs updating for MayGeneratePoisonRecipes is VPReplicateRecipe. Depends on D149081. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D149082	2023-05-09 12:33:28 +01:00
Florian Hahn	127b00b25c	[VPlan] Record IR flags on VPWidenRecipe directly (NFC). This patch introduces a VPRecipeWithIRFlags class to record various IR flags for a recipe. This allows de-coupling of IR flags from the underlying instructions. The main benefit is that it allows dropping of IR flags from recipes directly, without the need to go through State::MayGeneratePoisonRecipes. The plan is to remove MayGeneratePoisonRecipes once all relevant recipes are transitioned. It also allows dropping IR flags during VPlan-to-VPlan transforms, which will be used in a follow-up patch to implement truncateToMinimalBitwidths as VPlan-to-VPlan transform. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D149079	2023-05-08 17:28:50 +01:00
Florian Hahn	823d35fd3b	[VPlan] Use RecipeBuilder to look up member when fixing IG (NFC). Recipes for interleave group members are recorded directly in the RecipeBuilder. Use it directly instead of going indirectly through VPlan's Value->VPValue mapping.	2023-05-07 18:02:27 +01:00
Florian Hahn	7b7be685d4	[VPlan] Use operands directly in VPInstructionsToVPRecipes (NFC). New that def-use chains are modeled directly in VPlan, we can simply use the operands of the recipe we are replacing. There is no need to use the operands of the underlying instruction to look up a VPValue.	2023-05-06 12:36:00 +01:00
Florian Hahn	01fa764c9a	[VPlan] Assert instead of check if VF is vector when widening GEPs(NFC) VPWidenGEPRecipe should not be generated for scalar VFs. Replace check with an assert.	2023-05-06 09:25:56 +01:00
Kazu Hirata	2b60bd5141	[Vectorize] Use Densemap::contains (NFC)	2023-05-06 00:02:54 -07:00
Alexey Bataev	2672c6e4dc	[SLP][NFC]Add processBuildVector member function, NFC. Introduce processBuildVector as a next step to generalize code for cost estimation and code emission for gather/buildvector nodes. Differential Revision: https://reviews.llvm.org/D149973	2023-05-05 11:00:53 -07:00
Florian Hahn	8bd02e5aef	[VPlan] Assert instead checking if VF is vec when widening calls (NFC) VPWidenCallRecipe should not be generated for scalar VFs. Replace check with an assert.	2023-05-05 18:21:57 +01:00
Vasileios Porpodas	7749f6e976	[SLP][NFC] Cleanup: Outline the code that vectorizes CmpInsts into a seaparate function. Differential Revision: https://reviews.llvm.org/D149919	2023-05-05 09:56:41 -07:00
Alexey Bataev	ca3f4236e4	[SLP][NFC]Add/use gather and createFreeeze member functions in ShuffleInstructionBuilder, NFC.	2023-05-05 09:12:54 -07:00

1 2 3 4 5 ...

3803 Commits