llvm-project

Author	SHA1	Message	Date
shamithoke	e3ef4612c1	Perform bitreverse using AVX512 GFNI for i32 and i64. (#81764 ) Currently, the lowering operation for bitreverse using Intel AVX512 GFNI only supports byte vectors Extend the operation to i32 and i64. --------- Co-authored-by: shami <shami_thoke@yahoo.com>	2024-04-10 20:22:44 +01:00
Noah Goldstein	6c40d463c2	[X86] Use `nneg` flag when trying to convert `uitofp` -> `sitofp` Closes #86694	2024-04-09 23:06:55 -05:00
Simon Pilgrim	4023329bbf	[X86] collectConcatOps - add ability to recurse through insert_subvector chains Allows us to match insert_subvector(insert_subvector(undef, insert_subvector(insert_subvector(undef, x, 0), y, 1), 0), 0), insert_subvector(insert_subvector(undef, z, 0), w, 1), 2)	2024-04-09 13:23:44 +01:00
Simon Pilgrim	0bbe953aa3	[X86] Fold extract_subvector(cvtps2dq(x),c) -> cvtps2dq(extract_subvector(x,c)) Help unblock #83402	2024-04-09 11:06:18 +01:00
Simon Pilgrim	170c525d79	[X86] combineExtractVectorElt - fold extract(trunc(x),c) -> trunc(extract(x,c))	2024-04-08 11:01:19 +01:00
Simon Pilgrim	2d0087424f	[DAG] Remove extract_vector_elt(freeze(x)), idx -> freeze(extract_vector_elt(x), idx) fold (#87480 ) Reverse the fold with handling inside canCreateUndefOrPoison for cases where we know that the extract index is in bounds. This exposed a number or regressions, and required some initial freeze handling of SCALAR_TO_VECTOR, which will require us to properly improve demandedelts support to handle its undef upper elements. There is still one outstanding regression to be addressed in the future - how do we want to handle folds involving frozen loads? Fixes #86968	2024-04-04 11:10:55 +01:00
Simon Pilgrim	8bc2d19c13	[X86] canonicalizeShuffleWithOp - don't fold VPERMI(BINOP(X,Y)) -> BINOP(VPERMI(X),VPERMI(Y)) VPERMI (VPERMQ/PD) is nearly always lane-crossing and poorly merges with target shuffles (other than itself). For now, I've restricted VPERMI to only merge with itself, constants, loads and splats. We might be able to merge with a few other special cases (AND/ANDNP with constant?), which could help the shuffle-vs-trunc-256.ll AVX512VL regression, but since that now gives similar codegen to the other AVX512 variants, I'd prefer to improve the shuffle lowering for that properly.	2024-04-02 18:38:37 +01:00
Simon Pilgrim	5b06de7f99	[X86] Add isLogicOp helper to match ISD::AND/OR/XOR and X86ISD::ANDNP We could easily support the X86ISD 'float' variants of the logic ops as well, but we don't have good test coverage at the moment (they're mainly for SSE1 targets).	2024-03-28 19:39:17 +00:00
Simon Pilgrim	dcd0f2b610	[X86] combineExtractFromVectorLoad support extraction from vector of different types to the extraction type/index combineExtractFromVectorLoad no longer uses the vector we're extracting from to determine the pointer offset calculation, allowing us to extract from types that have been bitcast to work with specific target shuffles. Fixes #85419	2024-03-27 17:01:41 +00:00
Simon Pilgrim	6d3ec56d3c	[X86] combineExtractWithShuffle - use combineExtractFromVectorLoad to extract scalar load from shuffled vector load Improves #85419	2024-03-27 14:54:25 +00:00
Simon Pilgrim	875aed17b9	[X86] Add combineExtractFromVectorLoad helper - pulled out of combineExtractVectorElt Prep work for #85419 to make it easier to reuse in other combines	2024-03-27 12:22:31 +00:00
Björn Pettersson	3e6e54eb79	[X86] Fix miscompile in combineShiftRightArithmetic (#86597 ) When folding (ashr (shl, x, c1), c2) we need to treat c1 and c2 as unsigned to find out if the combined shift should be a left or right shift. Also do an early out during pre-legalization in case c1 and c2 has differet types, as that otherwise complicated the comparison of c1 and c2 a bit.	2024-03-26 20:53:34 +01:00
Simon Pilgrim	d18bee2313	[X86] combineConcatVectorOps - concatenate FADD/FSUB/FMUL ops if we don't increase the number of INSERT_SUBVECTOR nodes. FADD/FSUB/FMUL are usually less port-bound than INSERT_SUBVECTOR, so only concatenate if it reduces the instruction count and doesn't introduce extra INSERT_SUBVECTOR nodes.	2024-03-26 15:03:41 +00:00
Phoebe Wang	2e4e04c590	[X86][BF16] Do not lower to VCVTNEPS2BF16 without AVX512VL (#86395 ) Fixes: #86305	2024-03-25 10:06:12 +08:00
Simon Pilgrim	7ccb31a5bc	[X86] splitVectorOp - share the same SDLoc argument instead of recreating it over and over again.	2024-03-20 11:21:09 +00:00
Jonas Paulsson	09bc6abba6	[MachineFrameInfo] Refactoring around computeMaxcallFrameSize() (NFC) (#78001 ) - Use computeMaxCallFrameSize() in PEI::calculateCallFrameInfo() instead of duplicating the code. - Set AdjustsStack in FinalizeISel instead of in computeMaxCallFrameSize().	2024-03-18 10:37:59 -04:00
SahilPatidar	186565513c	[X86][AVX] Fix handling of out-of-bounds SRA shift amounts in AVX2 vector shift nodes (#84426 )	2024-03-15 16:34:33 +00:00
Simon Pilgrim	c957715d72	[X86] isGuaranteedNotToBeUndefOrPoisonForTargetNode - generalize shuffle decoding to support more target shuffles in the future.	2024-03-15 14:13:14 +00:00
Phoebe Wang	f4676b6be6	[X86] Add Support for X86 TLSDESC Relocations (#83136 )	2024-03-15 22:09:56 +08:00
Simon Pilgrim	2e271ceff6	Revert 4fef8c75abb080e6471395492819171fee8261fa "[X86] splitVectorOp - share the same SDLoc argument instead of recreating it over and over again." This appears to have broken the clang-with-thin-lto-ubuntu buildbot somehow (unconfirmed but its a likely candidate)	2024-03-14 12:54:58 +00:00
Simon Pilgrim	4fef8c75ab	[X86] splitVectorOp - share the same SDLoc argument instead of recreating it over and over again.	2024-03-14 10:47:26 +00:00
Simon Pilgrim	c1af6ab505	[X86] getFauxShuffleMask - recognise CONCAT(SUB0, SUB1) style patterns Handles the INSERT_SUBVECTOR(INSERT_SUBVECTOR(UNDEF,SUB0,0),SUB1,N) pattern Currently limited to v8i64/v8f64 cases as only AVX512 has decent cross lane 2-input shuffles, the plan is to relax this as I deal with some regressions	2024-03-12 17:40:19 +00:00
Simon Pilgrim	683a9ac803	[X86] combineVectorPack - use APInt::truncSSat for PACKSS constant folding. NFC. Unfortunately PACKUS can't use APInt::truncUSat	2024-03-12 17:10:27 +00:00
Simon Pilgrim	ffe41819e5	[Support] Add KnownBits::abds signed absolute difference and rename absdiff -> abdu (#84897 ) When I created KnownBits::absdiff, I totally missed that we already have ISD::ABDS/ABDU nodes, and we use this term in other places/targets as well. I've added the KnownBits::abds implementation and renamed KnownBits::absdiff to KnownBits::abdu. Followup to #84791	2024-03-12 12:10:30 +00:00
Simon Pilgrim	7b90a67fe7	[X86] Assert that the supportedVectorShift* helpers are only called with generic shift opcodes. NFC.	2024-03-11 11:21:48 +00:00
Simon Pilgrim	862c7e0218	[X86] combineAndShuffleNot - ensure the type is legal before create X86ISD::ANDNP target nodes Fixes #84660	2024-03-10 16:23:51 +00:00
Noah Goldstein	9f96db8e31	[X86] Fold `(icmp ult (add x,-C),2)` -> `(or (icmp eq X,C), (icmp eq X,C+1))` for Vectors This is undoing a middle-end transform which does the opposite. Since X86 doesn't have unsigned vector comparison instructions pre-AVX512, the simplified form gets worse codegen. Fixes #66479 Proofs: https://alive2.llvm.org/ce/z/UCz3wt Closes #84104 Closes #66479	2024-03-07 13:12:09 -06:00
Simon Pilgrim	0bd9255f8a	[X86] Improve KnownBits for X86ISD::PSADBW nodes (#83830 ) Don't just return the known zero upperbits, compute the absdiff Knownbits and perform the horizontal sum. Add implementations that handle both the X86ISD::PSADBW nodes and the INTRINSIC_WO_CHAIN intrinsics (pre-legalization).	2024-03-06 17:23:15 +00:00
Simon Pilgrim	5a896c66e3	[X86] Merge repeated getTargetLoweringInfo() calls. NFC.	2024-03-06 15:47:35 +00:00
Simon Pilgrim	f53c2f66a7	[X86] combineSetCC - use getZExtOrTrunc() to perform the constant folding. NFCI	2024-03-06 15:47:34 +00:00
Noah Goldstein	a4951eca40	Recommit "[X86] Don't always separate conditions in `(br (and/or cond0, cond1))` into separate branches" (2nd Try) Changes in Recommit: 1) Fix non-determanism by using `SmallMapVector` instead of `SmallPtrSet`. 2) Fix bug in dependency pruning where we discounted the actual `and/or` combining the two conditions. This lead to over pruning. Closes #81689	2024-03-04 13:23:56 -06:00
Shilei Tian	8300f30a92	[SelectionDAG] Add `STRICT_BF16_TO_FP` and `STRICT_FP_TO_BF16` (#80056 ) This patch adds the support for `STRICT_BF16_TO_FP` and `STRICT_FP_TO_BF16`.	2024-03-04 01:08:49 -05:00
Shilei Tian	2c5d01c2cf	Revert "[SelectionDAG] Add `STRICT_BF16_TO_FP` and `STRICT_FP_TO_BF16` (#80056 )" This reverts commit b0c158bd947c360a4652eb0de3a4794f46deb88b. The changes in `compiler-rt` broke tests.	2024-03-04 00:33:31 -05:00
Shilei Tian	b0c158bd94	[SelectionDAG] Add `STRICT_BF16_TO_FP` and `STRICT_FP_TO_BF16` (#80056 ) This patch adds the support for `STRICT_BF16_TO_FP` and `STRICT_FP_TO_BF16`.	2024-03-04 00:01:50 -05:00
NAKAMURA Takumi	5b4759f9fd	Revert "[X86] Don't always separate conditions in `(br (and/or cond0, cond1))` into separate branches" This has been buggy for a while. Reverts #81689 This reverts commit ae76dfb74701e05e5ab4be194e20e49f10768e46.	2024-03-03 22:31:28 +09:00
Simon Pilgrim	ca827d53c5	[X86] Convert logicalshift(x, C) -> and(x, M) iff x is allsignbits (#83596 ) If we're logical shifting an all-signbits value, then we can just mask out the shifted bits. This helps removes some unnecessary bitcasted vXi16 shifts used for vXi8 shifts (which SimplifyDemandedBits will struggle to remove through the bitcast), and allows some AVX1 shifts of 256-bit values to stay as a YMM instruction. Noticed in codegen from #82290	2024-03-02 12:44:33 +00:00
Noah Goldstein	ae76dfb747	[X86] Don't always separate conditions in `(br (and/or cond0, cond1))` into separate branches It makes sense to split if the cost of computing `cond1` is high (proportionally to how likely `cond0` is), but it doesn't really make sense to introduce a second branch if its only a few instructions. Splitting can also get in the way of potentially folding patterns. This patch introduces some logic to try to check if the cost of computing `cond1` is relatively low, and if so don't split the branches. Modest improvement on clang bootstrap build: https://llvm-compile-time-tracker.com/compare.php?from=79ce933114e46c891a5632f7ad4a004b93a5b808&to=978278eabc0bafe2f390ca8fcdad24154f954020&stat=cycles Average stage2-O3: 0.59% Improvement (cycles) Average stage2-O0-g: 1.20% Improvement (cycles) Likewise on llvm-test-suite on SKX saw a net 0.84% improvement (cycles) There is also a modest compile time improvement with this patch: https://llvm-compile-time-tracker.com/compare.php?from=79ce933114e46c891a5632f7ad4a004b93a5b808&to=978278eabc0bafe2f390ca8fcdad24154f954020&stat=instructions%3Au Note that the stage2 instruction count increases is expected, this patch trades instructions for decreasing branch-misses (which is proportionately lower): https://llvm-compile-time-tracker.com/compare.php?from=79ce933114e46c891a5632f7ad4a004b93a5b808&to=978278eabc0bafe2f390ca8fcdad24154f954020&stat=branch-misses NB: This will also likely help for APX targets with the new `CCMP` and `CTEST` instructions. Closes #81689	2024-03-01 15:35:34 -06:00
Simon Pilgrim	765a5d62bc	[X86] Pre-SSE42 v2i64 sgt lowering - check if representable as v2i32 (#83560 ) Without PCMPGTQ, if the i64 elements are sign-extended enough to be representable as i32 then we can compare the lower i32 bits with PCMPGTD and splat the results into the upper elements. Value tracking has meant we already get pretty close with this, but this allows us to remove a lot of unnecessary bit flipping.	2024-03-01 14:29:12 +00:00
Simon Pilgrim	80a328b011	[X86] SimplifyDemandedVectorEltsForTargetNode - add basic PCMPEQ/PCMPGT handling	2024-02-29 15:22:12 +00:00
Simon Pilgrim	139bcda542	[X86] SimplifyDemandedVectorEltsForTargetNode - add basic CVTPH2PS/CVTPS2PH handling Allows us to peek through the F16 conversion nodes, mainly to simplify shuffles An easy part of #83414	2024-02-29 12:33:49 +00:00
Simon Pilgrim	7ff3f9760d	[X86] getFauxShuffleMask - handle insert_vector_elt(bitcast(extract_vector_elt(x))) shuffle patterns If the bitcast is between types of equal scalar size (i.e. fp<->int bitcasts), then we can safely peek through them Fixes #83289	2024-02-29 10:32:49 +00:00
Simon Pilgrim	6287b7b9e9	[X86] combineEXTRACT_SUBVECTOR - extract 256-bit comparisons if only one subvector is required If only one subvector extraction will be necessary (i.e. because the other is constant etc.) then extract the source operands and perform as a 128-bit comparison Ideally DAGCombiner's narrowExtractedVectorBinOp would handle this but its tricky to confirm when a target opcode can be safely extracted and performed as a different vector type Partially improves an outstanding regression in #82290	2024-02-28 12:24:34 +00:00
Simon Pilgrim	c95febcb40	[X86] LowerBITREVERSE - add handling for all legal 128/256/512-bit vector types, not just vXi8 Move the BITREVERSE(BSWAP(X)) expansion into LowerBITREVERSE to help simplify #81764	2024-02-27 17:46:30 +00:00
Simon Pilgrim	13c359aa9b	[X86] ReplaceNodeResults - truncate sub-128-bit vectors as shuffles directly (#83120 ) We were scalarizing these truncations, but in most cases we can widen the source vector to 128-bits and perform the truncation as a shuffle directly (which will usually lower as a PACK or PSHUFB). For the cases where the widening and shuffle isn't legal we can leave it to generic legalization to scalarize for us. Fixes #81883	2024-02-27 15:03:42 +00:00
Simon Pilgrim	b8c9b06134	[X86] LowerCTPOP - add i3 and i4 LUT 'shift+mask' expansions Use the 3 or 4 active bits as a shift amount into a i32/i64 constant representing the number of set bits. In future, it might be worthwhile to move this into a generic location in case other targets want to make use of them. Another expansion pulled from #79823	2024-02-21 13:53:47 +00:00
Simon Pilgrim	98a07f72ee	[X86] LowerCTPOP - "ctpop(i2 x) --> sub(x, (x >> 1))" If we only have 2 active bits then we can avoid the i8 CTPOP multiply expansion entirely Another expansion pulled from #79823	2024-02-21 13:53:47 +00:00
Simon Pilgrim	066773c411	[X86] computeKnownBitsForTargetNode - add generic handling of PSHUFB When PSHUFB is used as a LUT (for CTPOP, BITREVERSE etc.), its the source operand that is constant and the index operand the variable. As long as the indices don't set the MSB (which zeros the output element), then the common known bits from the source operand can be used directly, even though the shuffle mask isn't constant. Further helps to improve CTPOP reduction codegen	2024-02-20 17:14:49 +00:00
Simon Pilgrim	2f1e33df32	[X86] Fold add(psadbw(X,0),psadbw(Y,0)) -> psadbw(add(X,Y),0) If the vXi8 add(X,Y) is guaranteed not to overflow then we can push the addition though the psadbw nodes (being used for reduction) and only need a single psadbw node. Noticed while working on CTPOP reduction codegen	2024-02-20 15:58:29 +00:00
Simon Pilgrim	539febfe30	[X86] combineEXTRACT_SUBVECTOR - share the same SDLoc argument instead of recreating it over and over again.	2024-02-20 15:58:29 +00:00
XinWang10	bb91b43719	[X86] Handle repeated blend mask in combineConcatVectorOps (#82155 ) https://github.com/llvm/llvm-project/commit/1d27669e8ad07f8f2 add support for fold 512-bit concat(blendi(x,y,c0),blendi(z,w,c1)) to AVX512BW mask select. But when the type of subvector is v16i16, we need to generate repeated mask to make the result correct. The subnode looks like t87: v16i16 = X86ISD::BLENDI t132, t58, TargetConstant:i8<-86>.	2024-02-19 09:24:21 +08:00

1 2 3 4 5 ...

8779 Commits