8779 Commits

Author SHA1 Message Date
shamithoke
e3ef4612c1
Perform bitreverse using AVX512 GFNI for i32 and i64. (#81764)
Currently, the lowering operation for bitreverse using Intel AVX512 GFNI only supports byte vectors

Extend the operation to i32 and i64.

---------

Co-authored-by: shami <shami_thoke@yahoo.com>
2024-04-10 20:22:44 +01:00
Noah Goldstein
6c40d463c2 [X86] Use nneg flag when trying to convert uitofp -> sitofp
Closes #86694
2024-04-09 23:06:55 -05:00
Simon Pilgrim
4023329bbf [X86] collectConcatOps - add ability to recurse through insert_subvector chains
Allows us to match insert_subvector(insert_subvector(undef, insert_subvector(insert_subvector(undef, x, 0), y, 1), 0), 0),
                                    insert_subvector(insert_subvector(undef, z, 0), w, 1), 2)
2024-04-09 13:23:44 +01:00
Simon Pilgrim
0bbe953aa3 [X86] Fold extract_subvector(cvtps2dq(x),c) -> cvtps2dq(extract_subvector(x,c))
Help unblock #83402
2024-04-09 11:06:18 +01:00
Simon Pilgrim
170c525d79 [X86] combineExtractVectorElt - fold extract(trunc(x),c) -> trunc(extract(x,c)) 2024-04-08 11:01:19 +01:00
Simon Pilgrim
2d0087424f
[DAG] Remove extract_vector_elt(freeze(x)), idx -> freeze(extract_vector_elt(x), idx) fold (#87480)
Reverse the fold with handling inside canCreateUndefOrPoison for cases where we know that the extract index is in bounds.

This exposed a number or regressions, and required some initial freeze handling of SCALAR_TO_VECTOR, which will require us to properly improve demandedelts support to handle its undef upper elements.

There is still one outstanding regression to be addressed in the future - how do we want to handle folds involving frozen loads?

Fixes #86968
2024-04-04 11:10:55 +01:00
Simon Pilgrim
8bc2d19c13 [X86] canonicalizeShuffleWithOp - don't fold VPERMI(BINOP(X,Y)) -> BINOP(VPERMI(X),VPERMI(Y))
VPERMI (VPERMQ/PD) is nearly always lane-crossing and poorly merges with target shuffles (other than itself).

For now, I've restricted VPERMI to only merge with itself, constants, loads and splats.

We might be able to merge with a few other special cases (AND/ANDNP with constant?), which could help the shuffle-vs-trunc-256.ll AVX512VL regression, but since that now gives similar codegen to the other AVX512 variants, I'd prefer to improve the shuffle lowering for that properly.
2024-04-02 18:38:37 +01:00
Simon Pilgrim
5b06de7f99 [X86] Add isLogicOp helper to match ISD::AND/OR/XOR and X86ISD::ANDNP
We could easily support the X86ISD 'float' variants of the logic ops as well, but we don't have good test coverage at the moment (they're mainly for SSE1 targets).
2024-03-28 19:39:17 +00:00
Simon Pilgrim
dcd0f2b610 [X86] combineExtractFromVectorLoad support extraction from vector of different types to the extraction type/index
combineExtractFromVectorLoad no longer uses the vector we're extracting from to determine the pointer offset calculation, allowing us to extract from types that have been bitcast to work with specific target shuffles.

Fixes #85419
2024-03-27 17:01:41 +00:00
Simon Pilgrim
6d3ec56d3c [X86] combineExtractWithShuffle - use combineExtractFromVectorLoad to extract scalar load from shuffled vector load
Improves #85419
2024-03-27 14:54:25 +00:00
Simon Pilgrim
875aed17b9 [X86] Add combineExtractFromVectorLoad helper - pulled out of combineExtractVectorElt
Prep work for #85419 to make it easier to reuse in other combines
2024-03-27 12:22:31 +00:00
Björn Pettersson
3e6e54eb79
[X86] Fix miscompile in combineShiftRightArithmetic (#86597)
When folding (ashr (shl, x, c1), c2) we need to treat c1 and c2
as unsigned to find out if the combined shift should be a left
or right shift.
Also do an early out during pre-legalization in case c1 and c2
has differet types, as that otherwise complicated the comparison
of c1 and c2 a bit.
2024-03-26 20:53:34 +01:00
Simon Pilgrim
d18bee2313 [X86] combineConcatVectorOps - concatenate FADD/FSUB/FMUL ops if we don't increase the number of INSERT_SUBVECTOR nodes.
FADD/FSUB/FMUL are usually less port-bound than INSERT_SUBVECTOR, so only concatenate if it reduces the instruction count and doesn't introduce extra INSERT_SUBVECTOR nodes.
2024-03-26 15:03:41 +00:00
Phoebe Wang
2e4e04c590
[X86][BF16] Do not lower to VCVTNEPS2BF16 without AVX512VL (#86395)
Fixes: #86305
2024-03-25 10:06:12 +08:00
Simon Pilgrim
7ccb31a5bc [X86] splitVectorOp - share the same SDLoc argument instead of recreating it over and over again. 2024-03-20 11:21:09 +00:00
Jonas Paulsson
09bc6abba6
[MachineFrameInfo] Refactoring around computeMaxcallFrameSize() (NFC) (#78001)
- Use computeMaxCallFrameSize() in PEI::calculateCallFrameInfo() instead of duplicating the code.

- Set AdjustsStack in FinalizeISel instead of in computeMaxCallFrameSize().
2024-03-18 10:37:59 -04:00
SahilPatidar
186565513c
[X86][AVX] Fix handling of out-of-bounds SRA shift amounts in AVX2 vector shift nodes (#84426) 2024-03-15 16:34:33 +00:00
Simon Pilgrim
c957715d72 [X86] isGuaranteedNotToBeUndefOrPoisonForTargetNode - generalize shuffle decoding to support more target shuffles in the future. 2024-03-15 14:13:14 +00:00
Phoebe Wang
f4676b6be6
[X86] Add Support for X86 TLSDESC Relocations (#83136) 2024-03-15 22:09:56 +08:00
Simon Pilgrim
2e271ceff6 Revert 4fef8c75abb080e6471395492819171fee8261fa "[X86] splitVectorOp - share the same SDLoc argument instead of recreating it over and over again."
This appears to have broken the clang-with-thin-lto-ubuntu buildbot somehow (unconfirmed but its a likely candidate)
2024-03-14 12:54:58 +00:00
Simon Pilgrim
4fef8c75ab [X86] splitVectorOp - share the same SDLoc argument instead of recreating it over and over again. 2024-03-14 10:47:26 +00:00
Simon Pilgrim
c1af6ab505 [X86] getFauxShuffleMask - recognise CONCAT(SUB0, SUB1) style patterns
Handles the INSERT_SUBVECTOR(INSERT_SUBVECTOR(UNDEF,SUB0,0),SUB1,N) pattern

Currently limited to v8i64/v8f64 cases as only AVX512 has decent cross lane 2-input shuffles, the plan is to relax this as I deal with some regressions
2024-03-12 17:40:19 +00:00
Simon Pilgrim
683a9ac803 [X86] combineVectorPack - use APInt::truncSSat for PACKSS constant folding. NFC.
Unfortunately PACKUS can't use APInt::truncUSat
2024-03-12 17:10:27 +00:00
Simon Pilgrim
ffe41819e5
[Support] Add KnownBits::abds signed absolute difference and rename absdiff -> abdu (#84897)
When I created KnownBits::absdiff, I totally missed that we already have ISD::ABDS/ABDU nodes, and we use this term in other places/targets as well.

I've added the KnownBits::abds implementation and renamed KnownBits::absdiff to KnownBits::abdu.

Followup to #84791
2024-03-12 12:10:30 +00:00
Simon Pilgrim
7b90a67fe7 [X86] Assert that the supportedVectorShift* helpers are only called with generic shift opcodes. NFC. 2024-03-11 11:21:48 +00:00
Simon Pilgrim
862c7e0218 [X86] combineAndShuffleNot - ensure the type is legal before create X86ISD::ANDNP target nodes
Fixes #84660
2024-03-10 16:23:51 +00:00
Noah Goldstein
9f96db8e31 [X86] Fold (icmp ult (add x,-C),2) -> (or (icmp eq X,C), (icmp eq X,C+1)) for Vectors
This is undoing a middle-end transform which does the opposite. Since
X86 doesn't have unsigned vector comparison instructions pre-AVX512,
the simplified form gets worse codegen.

Fixes #66479

Proofs: https://alive2.llvm.org/ce/z/UCz3wt

Closes #84104
Closes #66479
2024-03-07 13:12:09 -06:00
Simon Pilgrim
0bd9255f8a
[X86] Improve KnownBits for X86ISD::PSADBW nodes (#83830)
Don't just return the known zero upperbits, compute the absdiff Knownbits and perform the horizontal sum.

Add implementations that handle both the X86ISD::PSADBW nodes and the INTRINSIC_WO_CHAIN intrinsics (pre-legalization).
2024-03-06 17:23:15 +00:00
Simon Pilgrim
5a896c66e3 [X86] Merge repeated getTargetLoweringInfo() calls. NFC. 2024-03-06 15:47:35 +00:00
Simon Pilgrim
f53c2f66a7 [X86] combineSetCC - use getZExtOrTrunc() to perform the constant folding. NFCI 2024-03-06 15:47:34 +00:00
Noah Goldstein
a4951eca40 Recommit "[X86] Don't always separate conditions in (br (and/or cond0, cond1)) into separate branches" (2nd Try)
Changes in Recommit:
    1) Fix non-determanism by using `SmallMapVector` instead of
       `SmallPtrSet`.
    2) Fix bug in dependency pruning where we discounted the actual
       `and/or` combining the two conditions. This lead to over pruning.

Closes #81689
2024-03-04 13:23:56 -06:00
Shilei Tian
8300f30a92 [SelectionDAG] Add STRICT_BF16_TO_FP and STRICT_FP_TO_BF16 (#80056)
This patch adds the support for `STRICT_BF16_TO_FP` and
`STRICT_FP_TO_BF16`.
2024-03-04 01:08:49 -05:00
Shilei Tian
2c5d01c2cf Revert "[SelectionDAG] Add STRICT_BF16_TO_FP and STRICT_FP_TO_BF16 (#80056)"
This reverts commit b0c158bd947c360a4652eb0de3a4794f46deb88b.

The changes in `compiler-rt` broke tests.
2024-03-04 00:33:31 -05:00
Shilei Tian
b0c158bd94
[SelectionDAG] Add STRICT_BF16_TO_FP and STRICT_FP_TO_BF16 (#80056)
This patch adds the support for `STRICT_BF16_TO_FP` and
`STRICT_FP_TO_BF16`.
2024-03-04 00:01:50 -05:00
NAKAMURA Takumi
5b4759f9fd Revert "[X86] Don't always separate conditions in (br (and/or cond0, cond1)) into separate branches"
This has been buggy for a while.

Reverts #81689
This reverts commit ae76dfb74701e05e5ab4be194e20e49f10768e46.
2024-03-03 22:31:28 +09:00
Simon Pilgrim
ca827d53c5
[X86] Convert logicalshift(x, C) -> and(x, M) iff x is allsignbits (#83596)
If we're logical shifting an all-signbits value, then we can just mask out the shifted bits.

This helps removes some unnecessary bitcasted vXi16 shifts used for vXi8 shifts (which SimplifyDemandedBits will struggle to remove through the bitcast), and allows some AVX1 shifts of 256-bit values to stay as a YMM instruction.

Noticed in codegen from #82290
2024-03-02 12:44:33 +00:00
Noah Goldstein
ae76dfb747 [X86] Don't always separate conditions in (br (and/or cond0, cond1)) into separate branches
It makes sense to split if the cost of computing `cond1` is high
(proportionally to how likely `cond0` is), but it doesn't really make
sense to introduce a second branch if its only a few instructions.

Splitting can also get in the way of potentially folding patterns.

This patch introduces some logic to try to check if the cost of
computing `cond1` is relatively low, and if so don't split the
branches.

Modest improvement on clang bootstrap build:
https://llvm-compile-time-tracker.com/compare.php?from=79ce933114e46c891a5632f7ad4a004b93a5b808&to=978278eabc0bafe2f390ca8fcdad24154f954020&stat=cycles
Average stage2-O3:   0.59% Improvement (cycles)
Average stage2-O0-g: 1.20% Improvement (cycles)

Likewise on llvm-test-suite on SKX saw a net 0.84% improvement  (cycles)

There is also a modest compile time improvement with this patch:
https://llvm-compile-time-tracker.com/compare.php?from=79ce933114e46c891a5632f7ad4a004b93a5b808&to=978278eabc0bafe2f390ca8fcdad24154f954020&stat=instructions%3Au

Note that the stage2 instruction count increases is expected, this
patch trades instructions for decreasing branch-misses (which is
proportionately lower):
https://llvm-compile-time-tracker.com/compare.php?from=79ce933114e46c891a5632f7ad4a004b93a5b808&to=978278eabc0bafe2f390ca8fcdad24154f954020&stat=branch-misses

NB: This will also likely help for APX targets with the new `CCMP` and
`CTEST` instructions.

Closes #81689
2024-03-01 15:35:34 -06:00
Simon Pilgrim
765a5d62bc
[X86] Pre-SSE42 v2i64 sgt lowering - check if representable as v2i32 (#83560)
Without PCMPGTQ, if the i64 elements are sign-extended enough to be representable as i32 then we can compare the lower i32 bits with PCMPGTD and splat the results into the upper elements.

Value tracking has meant we already get pretty close with this, but this allows us to remove a lot of unnecessary bit flipping.
2024-03-01 14:29:12 +00:00
Simon Pilgrim
80a328b011 [X86] SimplifyDemandedVectorEltsForTargetNode - add basic PCMPEQ/PCMPGT handling 2024-02-29 15:22:12 +00:00
Simon Pilgrim
139bcda542 [X86] SimplifyDemandedVectorEltsForTargetNode - add basic CVTPH2PS/CVTPS2PH handling
Allows us to peek through the F16 conversion nodes, mainly to simplify shuffles

An easy part of #83414
2024-02-29 12:33:49 +00:00
Simon Pilgrim
7ff3f9760d [X86] getFauxShuffleMask - handle insert_vector_elt(bitcast(extract_vector_elt(x))) shuffle patterns
If the bitcast is between types of equal scalar size (i.e. fp<->int bitcasts), then we can safely peek through them

Fixes #83289
2024-02-29 10:32:49 +00:00
Simon Pilgrim
6287b7b9e9 [X86] combineEXTRACT_SUBVECTOR - extract 256-bit comparisons if only one subvector is required
If only one subvector extraction will be necessary (i.e. because the other is constant etc.) then extract the source operands and perform as a 128-bit comparison

Ideally DAGCombiner's narrowExtractedVectorBinOp would handle this but its tricky to confirm when a target opcode can be safely extracted and performed as a different vector type

Partially improves an outstanding regression in #82290
2024-02-28 12:24:34 +00:00
Simon Pilgrim
c95febcb40 [X86] LowerBITREVERSE - add handling for all legal 128/256/512-bit vector types, not just vXi8
Move the BITREVERSE(BSWAP(X)) expansion into LowerBITREVERSE to help simplify #81764
2024-02-27 17:46:30 +00:00
Simon Pilgrim
13c359aa9b
[X86] ReplaceNodeResults - truncate sub-128-bit vectors as shuffles directly (#83120)
We were scalarizing these truncations, but in most cases we can widen the source vector to 128-bits and perform the truncation as a shuffle directly (which will usually lower as a PACK or PSHUFB).

For the cases where the widening and shuffle isn't legal we can leave it to generic legalization to scalarize for us.

Fixes #81883
2024-02-27 15:03:42 +00:00
Simon Pilgrim
b8c9b06134 [X86] LowerCTPOP - add i3 and i4 LUT 'shift+mask' expansions
Use the 3 or 4 active bits as a shift amount into a i32/i64 constant representing the number of set bits.

In future, it might be worthwhile to move this into a generic location in case other targets want to make use of them.

Another expansion pulled from #79823
2024-02-21 13:53:47 +00:00
Simon Pilgrim
98a07f72ee [X86] LowerCTPOP - "ctpop(i2 x) --> sub(x, (x >> 1))"
If we only have 2 active bits then we can avoid the i8 CTPOP multiply expansion entirely

Another expansion pulled from #79823
2024-02-21 13:53:47 +00:00
Simon Pilgrim
066773c411 [X86] computeKnownBitsForTargetNode - add generic handling of PSHUFB
When PSHUFB is used as a LUT (for CTPOP, BITREVERSE etc.), its the source operand that is constant and the index operand the variable. As long as the indices don't set the MSB (which zeros the output element), then the common known bits from the source operand can be used directly, even though the shuffle mask isn't constant.

Further helps to improve CTPOP reduction codegen
2024-02-20 17:14:49 +00:00
Simon Pilgrim
2f1e33df32 [X86] Fold add(psadbw(X,0),psadbw(Y,0)) -> psadbw(add(X,Y),0)
If the vXi8 add(X,Y) is guaranteed not to overflow then we can push the addition though the psadbw nodes (being used for reduction) and only need a single psadbw node.

Noticed while working on CTPOP reduction codegen
2024-02-20 15:58:29 +00:00
Simon Pilgrim
539febfe30 [X86] combineEXTRACT_SUBVECTOR - share the same SDLoc argument instead of recreating it over and over again. 2024-02-20 15:58:29 +00:00
XinWang10
bb91b43719
[X86] Handle repeated blend mask in combineConcatVectorOps (#82155)
https://github.com/llvm/llvm-project/commit/1d27669e8ad07f8f2 add
support for fold 512-bit concat(blendi(x,y,c0),blendi(z,w,c1)) to
AVX512BW mask select.
But when the type of subvector is v16i16, we need to generate repeated
mask to make the result correct.
The subnode looks like t87: v16i16 = X86ISD::BLENDI t132, t58,
TargetConstant:i8<-86>.
2024-02-19 09:24:21 +08:00