llvm-project

Author	SHA1	Message	Date
Jay Foad	7b3bbd83c0	Revert "[CodeGen] Really renumber slot indexes before register allocation (#67038 )" This reverts commit 2501ae58e3bb9a70d279a56d7b3a0ed70a8a852c. Reverted due to various buildbot failures.	2023-10-09 12:31:32 +01:00
Jay Foad	2501ae58e3	[CodeGen] Really renumber slot indexes before register allocation (#67038 ) PR #66334 tried to renumber slot indexes before register allocation, but the numbering was still affected by list entries for instructions which had been erased. Fix this to make the register allocator's live range length heuristics even less dependent on the history of how instructions have been added to and removed from SlotIndexes's maps.	2023-10-09 11:44:41 +01:00
Simon Pilgrim	943fda567a	[X86] matchTruncateWithPACK - canonically prefer v4i64 -> v4i32 shuffle vs truncation Pulled out of LowerTruncateVecPackWithSignBits - prefer shuffles unless we can cheaply split the vector. ComputeNumSignBits struggles with vXi64 through bitcasts, so we're usually better off with shuffles.	2023-08-08 10:05:24 +01:00
Simon Pilgrim	071671e15c	[X86] Allow pre-SSE41 targets to extract multiple v16i8 elements coming from the same DWORD/WORD super-element Pre-SSE41 targets tended to have weak (serial) GPR<->VEC moves, meaning we only allowed a single v16i8 extraction before spilling the vector to stack and loading the i8 elements instead. But this didn't make use of the DWORD/WORD extraction we had to use could extract multiple i8 elements at the same time. This patch attempts to determine if all uses of a vector are element extractions, and works out whether all the extractions share the same WORD or (lowest) DWORD, in which case we can perform a single extraction and just shift/truncate the individual elements. Differential Revision: https://reviews.llvm.org/D156350	2023-07-31 17:08:34 +01:00
Simon Pilgrim	842a6728d9	[X86] LowerTRUNCATE - improve handling during type legalization to PACKSS/PACKUS patterns Extend coverage for lowering wide vector types during type legalization to allow us to use PACKSS/PACKUS patterns instead of dropping down to shuffle lowering. First step towards avoiding premature folds of TRUNCATE to PACKSS/PACKUS nodes as described on Issue #63710 - which causes a large number of regressions on D152928 - we will next need to tweak the TRUNCATE widening in ReplaceNodeResults Differential Revision: https://reviews.llvm.org/D154592	2023-07-11 10:39:44 +01:00
Simon Pilgrim	3f7470c33d	[X86] Fold BITOP(PACKSS(X,Z),PACKSS(Y,W)) --> PACKSS(BITOP(X,Y),BITOP(Z,W)) (REAPPLIED) Fold allsignbits pack patterns to make better use of cheap (and commutable) logic ops Reapplied after a32d14fd4c0a / 156913cb7764 with bitcast fix	2023-07-06 10:56:07 +01:00
Arthur Eubanks	156913cb77	Revert "[X86] Fold BITOP(PACKSS(X,Z),PACKSS(Y,W)) --> PACKSS(BITOP(X,Y),BITOP(Z,W))" This reverts commit a32d14fd4c0a43c154f251df1ccfe57e8b0a711a. Causes crashes, see https://reviews.llvm.org/rGa32d14fd4c0a43c154f251df1ccfe57e8b0a711a.	2023-07-05 14:52:57 -07:00
Simon Pilgrim	a32d14fd4c	[X86] Fold BITOP(PACKSS(X,Z),PACKSS(Y,W)) --> PACKSS(BITOP(X,Y),BITOP(Z,W)) Fold allsignbits pack patterns to make better use of cheap (and commutable) logic ops	2023-07-05 16:43:43 +01:00
Simon Pilgrim	f6ff2cc7e0	[X86] X86FixupVectorConstantsPass - attempt to replace full width integer vector constant loads with broadcasts on AVX2+ targets (REAPPLIED) lowerBuildVectorAsBroadcast will not broadcast splat constants in all cases, resulting in a lot of situations where a full width vector load that has failed to fold but is loading splat constant values could use a broadcast load instruction just as cheaply, and save constant pool space. This is an updated commit of ab4b924832ce26c21b88d7f82fcf4992ea8906bb after being reverted at 78de45fd4a902066617fcc9bb88efee11f743bc6	2023-06-14 12:48:33 +01:00
Amaury Séchet	a70d5e25f3	[DAGCombine] Make sure combined nodes are added back to the worklist in topological order. Currently, a node and its users are added back to the worklist in reverse topological order after it is combined. This diff changes that order to be topological. This is part of a larger migration to get the DAGCombiner to process nodes in topological order. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D127115	2023-06-13 09:14:37 +00:00
Simon Pilgrim	78de45fd4a	Revert rGab4b924832ce26c21b88d7f82fcf4992ea8906bb - [X86] X86FixupVectorConstantsPass - attempt to replace full width integer vector constant loads with broadcasts on AVX2+ targets Reverting while we address an existing issue exposed by this (Issue #63108)	2023-06-06 18:07:33 +01:00
JP Lehr	c9998ec145	Revert "[DAGCombine] Make sure combined nodes are added back to the worklist in topological order." This reverts commit e69fa03ddd85812be3143d79a0359c3e8d43bd45. This patch lead to build time outs on the AMDGPU OpenMP runtime buildbot.	2023-06-05 10:55:58 -04:00
Amaury Séchet	e69fa03ddd	[DAGCombine] Make sure combined nodes are added back to the worklist in topological order. Currently, a node and its users are added back to the worklist in reverse topological order after it is combined. This diff changes that order to be topological. This is part of a larger migration to get the DAGCombiner to process nodes in topological order. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D127115	2023-06-05 11:09:18 +00:00
Simon Pilgrim	ab4b924832	[X86] X86FixupVectorConstantsPass - attempt to replace full width integer vector constant loads with broadcasts on AVX2+ targets lowerBuildVectorAsBroadcast will not broadcast splat constants in all cases, resulting in a lot of situations where a full width vector load that has failed to fold but is loading splat constant values could use a broadcast load instruction just as cheaply, and save constant pool space.	2023-05-30 13:17:26 +01:00
Simon Pilgrim	0b91de5ea3	[X86] Add X86FixupVectorConstantsPass to re-fold AVX512 vector load folds as broadcast folds This patch analyzes AVX512 instructions for full vector width folded loads from the constant pool and attempts to determine if it can be replaced with a smaller broadcast folded variant. Typically the broadcast opportunities were missed by type-width mismatches or mulituse limitations which have been removed in later passes. As well as introducing broadcast fold tables (which can hopefully be extended/automated in the future), this also handles mismatches in the AND/ANDN/OR/XOR/TERNLOG type-widths, catching additional missed opportunities. This is patch is pulled from the ongoing work based on D150143, but without removing the existing DAG constant broadcast lowering code - this patch is currently a late stage cleanup only. The intention is to add additional broadcast/extension handling of constants in future patches, but it turned out that AVX512 broadcast handling was the easiest to start with. Differential Revision: https://reviews.llvm.org/D150526	2023-05-23 10:58:17 +01:00
Simon Pilgrim	3a0c1d5ab9	[X86] combineSetCCMOVMSK - fold anyof/noneof movmskps/movmskpd -> testps/testpd Another part of Issue #60007	2023-04-14 15:37:40 +01:00
Simon Pilgrim	d891968f10	[X86] combineMOVMSK - merge movmsk(icmp_eq(and(x,c1),c1)) and movmsk(icmp_eq(and(x,c1),0)) folds Use the same value tracking implementation for both, removing hardcoded PCMPEQ(AND(X,C),C) pattern so to handle bitcasted logic/constants.	2023-04-03 17:17:52 +01:00
Simon Pilgrim	6865cff8ea	[X86] combineMOVMSK - fold movmsk(icmp_eq(and(x,c1),c1)) -> movmsk(shl(x,c2)) iff pow2splat(c1) We already have a similar fold for movmsk(icmp_eq(and(x,c1),0)) which we can probably merge this with, but it will involve generalizing a lot of the knownbits code	2023-04-03 15:11:13 +01:00
Simon Pilgrim	39d7bf6e34	[X86] MatchVectorAllZeroTest - handle icmp_eq(bitcast(vXi1 trunc(Y)),0) style reduction patterns If we've truncated from a wider vector, then perform the all vector comparison on that with a suitable mask There's a minor pre-SSE41 regression due to a missing movmsk(icmp_eq(and(x,c1pow2),c1pow2)) -> movmsk(shl(x,c2)) fold that will be addressed in a followup commit	2023-04-03 15:11:13 +01:00
Simon Pilgrim	82df745e57	[X86] Test coverage for icmp_eq(bitcast(vXi1 trunc(Y)),0) style reduction patterns Add SSE41 test coverage as well	2023-04-03 13:16:59 +01:00
Simon Pilgrim	37c3b83bd8	[X86] combineBitcastvxi1 - handle boolmask sign-extension through vselect See if we can freely sign-extend both sources of a vselect operand, also handle allones constant build vectors (easily rematerializable and uses in the test case). Fixes #59526	2022-12-15 16:40:44 +00:00
Simon Pilgrim	4da6a983ad	[X86] Add test case for Issue #59526	2022-12-15 16:19:41 +00:00
Simon Pilgrim	4e8f847676	[X86][AVX512] Fold extract_element(bitcast(<X x i1>) -> bitcast(extract_subvector()) On AVX512, extract legal bool vectors as bool subvectors before bitcasting to scalars to avoid spilling to stack. This helps rust which internally represents bool vectors as bool arrays It also exposes more missed opportunities to use the KADD instruction to add masks together before moving to gpr Fixes #58546	2022-10-23 14:47:24 +01:00
Simon Pilgrim	c7fb1cdcba	[X86] Add test case for Issue #58546	2022-10-23 14:27:37 +01:00
Sanjay Patel	f0dd12ec5c	[x86] use zero-extending load of a byte outside of loops too (2nd try) The first attempt missed changing test files for tools (update_llc_test_checks.py). Original commit message: This implements the main suggested change from issue #56498. Using the shorter (non-extending) instruction with only -Oz ("minsize") rather than -Os ("optsize") is left as a possible follow-up. As noted in the bug report, the zero-extending load may have shorter latency/better throughput across a wide range of x86 micro-arches, and it avoids a potential false dependency. The cost is an extra instruction byte. This could cause perf ups and downs from secondary effects, but I don't think it is possible to account for those in advance, and that will likely also depend on exact micro-arch. This does bring LLVM x86 codegen more in line with existing gcc codegen, so if problems are exposed they are more likely to occur for both compilers. Differential Revision: https://reviews.llvm.org/D129775	2022-07-19 21:27:08 -04:00
Sanjay Patel	95401b0153	Revert "[x86] use zero-extending load of a byte outside of loops too" This reverts commit 9d1ea1774c51c44ddf0b5065bf600919988d7015. There are tests of update_llc_tests_checks.py that missed being updated.	2022-07-19 17:37:22 -04:00
Sanjay Patel	9d1ea1774c	[x86] use zero-extending load of a byte outside of loops too This implements the main suggested change from issue #56498. Using the shorter (non-extending) instruction with only -Oz ("minsize") rather than -Os ("optsize") is left as a possible follow-up. As noted in the bug report, the zero-extending load may have shorter latency/better throughput across a wide range of x86 micro-arches, and it avoids a potential false dependency. The cost is an extra instruction byte. This could cause perf ups and downs from secondary effects, but I don't think it is possible to account for those in advance, and that will likely also depend on exact micro-arch. This does bring LLVM x86 codegen more in line with existing gcc codegen, so if problems are exposed they are more likely to occur for both compilers. Differential Revision: https://reviews.llvm.org/D129775	2022-07-19 16:43:47 -04:00
Sanjay Patel	c486b82cfb	[x86] try harder to scalarize a vector load with extracted integer op uses This is a retry of b4b97ec813a0 - that was reverted because it could cause miscompiles by illegally reordering memory operations. A new test based on #53695 is added here to verify we do not have that same problem. extract_vec_elt (load X), C --> scalar load (X+C) As noted in the comment, DAGCombiner has this fold -- and the code in this patch is adapted from DAGCombiner::scalarizeExtractedVectorLoad() -- but x86 should benefit even if the loaded vector has other uses as long as we apply some other x86-specific conditions. The motivating example from #50310 is shown in vec_int_to_fp.ll. Fixes #50310 Fixes #53695 Differential Revision: https://reviews.llvm.org/D118376	2022-02-13 08:32:21 -05:00
Sanjay Patel	7b03725097	Revert "[x86] try harder to scalarize a vector load with extracted integer op uses" This reverts commit b4b97ec813a02585000f30ac7d532dda74e8bfda. As discussed in post-commit feedback at: https://reviews.llvm.org/D118376 ...there's a stage 2 failure on a Mac running a clang-refactor tool test.	2022-02-04 07:45:57 -05:00
Sanjay Patel	b4b97ec813	[x86] try harder to scalarize a vector load with extracted integer op uses extract_vec_elt (load X), C --> scalar load (X+C) As noted in the comment, DAGCombiner has this fold -- and the code in this patch is adapted from DAGCombiner::scalarizeExtractedVectorLoad() -- but x86 should benefit even if the loaded vector has other uses as long as we apply some other x86-specific conditions. The motivating example from #50310 is shown in vec_int_to_fp.ll. Fixes #50310 Differential Revision: https://reviews.llvm.org/D118376	2022-01-28 10:22:52 -05:00
Eli Friedman	bdd55b2f18	Fix the default alignment of i1 vectors. Currently, the default alignment is much larger than the actual size of the vector in memory. Fix this to use a sane default. For SVE, temporarily remove lowering of load/store operations for predicates with less than 16 elements. The layout the backend was assuming for SVE predicates with less than 16 elements doesn't agree with the frontend. More work probably needs to be done here. This change is, strictly speaking, not backwards-compatible at the bitcode level. But probably nobody is actually depending on that; i1 vectors in memory are rare, and the code that does use them probably ends up forcing the alignment to something sane anyway. If we think this is a concern, I can restrict this to scalable vectors for now (where it's actually causing issues for me at the moment). Differential Revision: https://reviews.llvm.org/D88994	2021-07-31 14:09:59 -07:00
Wang, Pengfei	07ba0662da	[CodeGen][X86] Remove unused check-prefixes from bitcast tests. NFCI.	2020-11-11 09:16:57 +08:00
Simon Pilgrim	0c005be6eb	[X86][SSE] getV4X86ShuffleImm8 - canonicalize broadcast masks If the mask input to getV4X86ShuffleImm8 only refers to a single source element (+ undefs) then canonicalize to a full broadcast. getV4X86ShuffleImm8 defaults to inline values for undefs, which can be useful for shuffle widening/narrowing but does leave SimplifyDemanded* calls thinking the shuffle depends on unnecessary elements. I'm still investigating what we should do more generally to avoid these undemanded elements, but broadcast cases was a simpler win.	2020-07-29 11:32:44 +01:00
Simon Pilgrim	71f342d6c3	[X86][AVX] Fold PACK(LOSUBVECTOR(SHUFFLE(X)),HISUBVECTOR(SHUFFLE(X))) -> SHUFFLE(PACK(LOSUBVECTOR(X),HISUBVECTOR(X))) Using PACK for truncations leaves us with intermediate shuffles that can be tricky to remove while the truncation tree is being formed. This fold helps pull out the PERMQ case which is one of the most common, avoiding some costly lane-crossing shuffles. A future patch will begin adding more general shuffle folding, which we should be able to use for HADD/HSUB as well.	2020-07-04 13:54:30 +01:00
Craig Topper	d3c79d1953	[X86] Add an AVX check prefix to bitcast-vector-bool.ll to combine checks where AVX1/2/512 are all the same. NFC	2020-06-21 20:30:17 -07:00
Craig Topper	cb5072d187	[X86] Teach combineBitcastvxi1 to prefer movmsk on avx512 in more cases If the input to the bitcast is a sign bit test, it makes sense to directly use vpmovmskb or vmovmskps/pd. This removes the need to copy the sign bits to a k-register and then to a GPR. Fixes PR46200. Differential Revision: https://reviews.llvm.org/D81327	2020-06-13 14:50:13 -07:00
Simon Pilgrim	70293ba26f	[DAG] SimplifyMultipleUseDemandedBits - remove superfluous bitcasts If the SimplifyMultipleUseDemandedBits calls BITCASTs that peek through back to the original type then we can remove the BITCASTs entirely. Differential Revision: https://reviews.llvm.org/D79572	2020-05-08 19:04:49 +01:00
LemonBoy	6d103ca855	[SelectionDAG] Unify scalarizeVectorLoad and VectorLegalizer::ExpandLoad The two code paths have the same goal, legalizing a load of a non-byte-sized vector by loading the "flattened" representation in memory, slicing off each single element and then building a vector out of those pieces. The technique employed by `ExpandLoad` is slightly more convoluted and produces slightly better codegen on ARM, AMDGPU and x86 but suffers from some bugs (D78480) and is wrong for BE machines. Differential Revision: https://reviews.llvm.org/D79096	2020-05-02 15:18:10 -07:00
Simon Pilgrim	05c0d34918	[X86][SSE] Prefer trunc(movd(x)) to pextrb(x,0) If we're extracting the 0'th index of a v16i8 vector we're better off using MOVD than PEXTRB, unless we're storing the value or we require the implicit zero extension of PEXTRB. The biggest perf diff is on SLM targets where MOVD (uops=1, lat=3 tp=1) is notably faster than PEXTRB (uops=2, lat=5, tp=4). This matches what we already do for PEXTRW. Differential Revision: https://reviews.llvm.org/D76138	2020-03-13 18:43:04 +00:00
Simon Pilgrim	0c78b64696	[X86][SSE] Add bitcast <128 x i1> %1 to <2 x i64> test case	2020-02-02 18:00:09 +00:00
Simon Pilgrim	17e91b7dd2	[X86][SSE] combineBitcastvxi1 - add pre-AVX512 v64i1 handling	2020-02-02 18:00:09 +00:00
Simon Pilgrim	e7e043724e	[DAG] Enable ISD::EXTRACT_SUBVECTOR SimplifyMultipleUseDemandedBits handling This allows SimplifyDemandedBits to call SimplifyMultipleUseDemandedBits to create a simpler ISD::EXTRACT_SUBVECTOR, which is particularly useful for cases where we're splitting into subvectors anyhow. Differential Revision: This allows SimplifyDemandedBits to call SimplifyMultipleUseDemandedBits to create a simpler ISD::EXTRACT_SUBVECTOR, which is particularly useful for cases where we're splitting into subvectors anyhow.	2020-01-27 21:17:47 +00:00
Simon Pilgrim	5340434c94	[X86][SSE] combineExtractWithShuffle - extract(bitcast(broadcast(x))) --> x Removes some unnecessary gpr<-->fpu traffic	2020-01-22 18:02:58 +00:00
Simon Pilgrim	a14aa7dabd	[X86][SSE] combineExtractWithShuffle - extract(bictcast(scalar_to_vector(x))) --> x Removes some unnecessary gpr<-->fpu traffic	2020-01-22 16:11:08 +00:00
Craig Topper	b9376690a0	[X86] Improve lowering of v2i64 sign bit tests on pre-sse4.2 targets Without sse4.2 a v2i64 setlt needs to expand into a pcmpgtd, pcmpeqd, 3 shuffles, and 2 logic ops. But if we're only interested in the sign bit of the i64 elements, we can just use one pcmpgtd and shuffle the odd elements to the even elements. Differential Revision: https://reviews.llvm.org/D72302	2020-01-07 11:22:03 -08:00
Simon Pilgrim	05e8209e33	[TargetLowering] SimplifyDemandedBits - call SimplifyMultipleUseDemandedBits for ISD::TRUNCATE llvm-svn: 368553	2019-08-12 10:56:05 +00:00
Craig Topper	8b5f2ab2a4	Recommit r367901 "[X86] Enable -x86-experimental-vector-widening-legalization by default." The assert that caused this to be reverted should be fixed now. Original commit message: This patch changes our defualt legalization behavior for 16, 32, and 64 bit vectors with i8/i16/i32/i64 scalar types from promotion to widening. For example, v8i8 will now be widened to v16i8 instead of promoted to v8i16. This keeps the elements widths the same and pads with undef elements. We believe this is a better legalization strategy. But it carries some issues due to the fragmented vector ISA. For example, i8 shifts and multiplies get widened and then later have to be promoted/split into vXi16 vectors. This has the potential to cause regressions so we wanted to get it in early in the 10.0 cycle so we have plenty of time to address them. Next steps will be to merge tests that explicitly test the command line option. And then we can remove the option and its associated code. llvm-svn: 368183	2019-08-07 16:24:26 +00:00
Mitch Phillips	bd0d97e1c4	Revert "[X86] Enable -x86-experimental-vector-widening-legalization by default." This reverts commit 3de33245d2c992c9e0af60372043540b60f3a810. This commit broke the MSan buildbots. See https://reviews.llvm.org/rL367901 for more information. llvm-svn: 368107	2019-08-06 23:00:43 +00:00
Craig Topper	3de33245d2	[X86] Enable -x86-experimental-vector-widening-legalization by default. This patch changes our defualt legalization behavior for 16, 32, and 64 bit vectors with i8/i16/i32/i64 scalar types from promotion to widening. For example, v8i8 will now be widened to v16i8 instead of promoted to v8i16. This keeps the elements widths the same and pads with undef elements. We believe this is a better legalization strategy. But it carries some issues due to the fragmented vector ISA. For example, i8 shifts and multiplies get widened and then later have to be promoted/split into vXi16 vectors. This has the potential to cause regressions so we wanted to get it in early in the 10.0 cycle so we have plenty of time to address them. Next steps will be to merge tests that explicitly test the command line option. And then we can remove the option and its associated code. llvm-svn: 367901	2019-08-05 18:25:36 +00:00
Simon Pilgrim	436fd52a71	[X86] lowerShuffleAsSpecificZeroOrAnyExtend - use undef PSHUFB mask indices for ANY_EXTEND shuffles llvm-svn: 367784	2019-08-04 13:15:23 +00:00

1 2

58 Commits