Currently, the lowering operation for bitreverse using Intel AVX512 GFNI only supports byte vectors
Extend the operation to i32 and i64.
---------
Co-authored-by: shami <shami_thoke@yahoo.com>
Allows us to match insert_subvector(insert_subvector(undef, insert_subvector(insert_subvector(undef, x, 0), y, 1), 0), 0),
insert_subvector(insert_subvector(undef, z, 0), w, 1), 2)
Reverse the fold with handling inside canCreateUndefOrPoison for cases where we know that the extract index is in bounds.
This exposed a number or regressions, and required some initial freeze handling of SCALAR_TO_VECTOR, which will require us to properly improve demandedelts support to handle its undef upper elements.
There is still one outstanding regression to be addressed in the future - how do we want to handle folds involving frozen loads?
Fixes#86968
VPERMI (VPERMQ/PD) is nearly always lane-crossing and poorly merges with target shuffles (other than itself).
For now, I've restricted VPERMI to only merge with itself, constants, loads and splats.
We might be able to merge with a few other special cases (AND/ANDNP with constant?), which could help the shuffle-vs-trunc-256.ll AVX512VL regression, but since that now gives similar codegen to the other AVX512 variants, I'd prefer to improve the shuffle lowering for that properly.
We could easily support the X86ISD 'float' variants of the logic ops as well, but we don't have good test coverage at the moment (they're mainly for SSE1 targets).
combineExtractFromVectorLoad no longer uses the vector we're extracting from to determine the pointer offset calculation, allowing us to extract from types that have been bitcast to work with specific target shuffles.
Fixes#85419
When folding (ashr (shl, x, c1), c2) we need to treat c1 and c2
as unsigned to find out if the combined shift should be a left
or right shift.
Also do an early out during pre-legalization in case c1 and c2
has differet types, as that otherwise complicated the comparison
of c1 and c2 a bit.
FADD/FSUB/FMUL are usually less port-bound than INSERT_SUBVECTOR, so only concatenate if it reduces the instruction count and doesn't introduce extra INSERT_SUBVECTOR nodes.
- Use computeMaxCallFrameSize() in PEI::calculateCallFrameInfo() instead of duplicating the code.
- Set AdjustsStack in FinalizeISel instead of in computeMaxCallFrameSize().
Handles the INSERT_SUBVECTOR(INSERT_SUBVECTOR(UNDEF,SUB0,0),SUB1,N) pattern
Currently limited to v8i64/v8f64 cases as only AVX512 has decent cross lane 2-input shuffles, the plan is to relax this as I deal with some regressions
When I created KnownBits::absdiff, I totally missed that we already have ISD::ABDS/ABDU nodes, and we use this term in other places/targets as well.
I've added the KnownBits::abds implementation and renamed KnownBits::absdiff to KnownBits::abdu.
Followup to #84791
This is undoing a middle-end transform which does the opposite. Since
X86 doesn't have unsigned vector comparison instructions pre-AVX512,
the simplified form gets worse codegen.
Fixes#66479
Proofs: https://alive2.llvm.org/ce/z/UCz3wtCloses#84104Closes#66479
Don't just return the known zero upperbits, compute the absdiff Knownbits and perform the horizontal sum.
Add implementations that handle both the X86ISD::PSADBW nodes and the INTRINSIC_WO_CHAIN intrinsics (pre-legalization).
Changes in Recommit:
1) Fix non-determanism by using `SmallMapVector` instead of
`SmallPtrSet`.
2) Fix bug in dependency pruning where we discounted the actual
`and/or` combining the two conditions. This lead to over pruning.
Closes#81689
If we're logical shifting an all-signbits value, then we can just mask out the shifted bits.
This helps removes some unnecessary bitcasted vXi16 shifts used for vXi8 shifts (which SimplifyDemandedBits will struggle to remove through the bitcast), and allows some AVX1 shifts of 256-bit values to stay as a YMM instruction.
Noticed in codegen from #82290
Without PCMPGTQ, if the i64 elements are sign-extended enough to be representable as i32 then we can compare the lower i32 bits with PCMPGTD and splat the results into the upper elements.
Value tracking has meant we already get pretty close with this, but this allows us to remove a lot of unnecessary bit flipping.
If only one subvector extraction will be necessary (i.e. because the other is constant etc.) then extract the source operands and perform as a 128-bit comparison
Ideally DAGCombiner's narrowExtractedVectorBinOp would handle this but its tricky to confirm when a target opcode can be safely extracted and performed as a different vector type
Partially improves an outstanding regression in #82290
We were scalarizing these truncations, but in most cases we can widen the source vector to 128-bits and perform the truncation as a shuffle directly (which will usually lower as a PACK or PSHUFB).
For the cases where the widening and shuffle isn't legal we can leave it to generic legalization to scalarize for us.
Fixes#81883
Use the 3 or 4 active bits as a shift amount into a i32/i64 constant representing the number of set bits.
In future, it might be worthwhile to move this into a generic location in case other targets want to make use of them.
Another expansion pulled from #79823
When PSHUFB is used as a LUT (for CTPOP, BITREVERSE etc.), its the source operand that is constant and the index operand the variable. As long as the indices don't set the MSB (which zeros the output element), then the common known bits from the source operand can be used directly, even though the shuffle mask isn't constant.
Further helps to improve CTPOP reduction codegen
If the vXi8 add(X,Y) is guaranteed not to overflow then we can push the addition though the psadbw nodes (being used for reduction) and only need a single psadbw node.
Noticed while working on CTPOP reduction codegen
https://github.com/llvm/llvm-project/commit/1d27669e8ad07f8f2 add
support for fold 512-bit concat(blendi(x,y,c0),blendi(z,w,c1)) to
AVX512BW mask select.
But when the type of subvector is v16i16, we need to generate repeated
mask to make the result correct.
The subnode looks like t87: v16i16 = X86ISD::BLENDI t132, t58,
TargetConstant:i8<-86>.