This is especially helpful for AArch64, which simplifies ands + cmp to tst.
Alive2: https://alive2.llvm.org/ce/z/LLgcJJ
---------
Co-authored-by: Simon Pilgrim <llvm-dev@redking.me.uk>
Always let SimplifyDemandedVectorElts fold either side of a
VECTOR_SHUFFLE to UNDEF if no elements are demanded from that side.
For a single use this could be done by SimplifyDemandedVectorElts
already, but in case the operand had multiple uses we did not eliminate
the use.
Work towards making RuntimeLibcalls the centralized location for
all libcall information. This requires changing the encoding from
tracking the ISD::CondCode to using CmpInst::Predicate.
This saves 2 instructions in the ARM soft float case for fcmp ueq.
This code is written in an confusingly overly general way. The point
of getCmpLibcallCC is to express that the compiler-rt implementations
of the FP compares are different aliases around functions which may
return -1 in some cases. This does not apply to the call for unordered,
which returns a normal boolean.
Also stop overriding the default value for the unordered compare for ARM.
This was setting it to the same value as the default, which is now assumed.
We have recently added the partial_reduce_smla and partial_reduce_umla
nodes to represent Acc += ext(b) * ext(b) where the two extends have to
have the same source type, and have the same extend kind.
For riscv64 w/zvqdotq, we have the vqdot and vqdotu instructions which
correspond to the existing nodes, but we also have vqdotsu which
represents the case where the two extends are sign and zero respective
(i.e. not the same type of extend).
This patch adds a partial_reduce_sumla node which has sign extension for
A, and zero extension for B. The addition is somewhat mechanical.
This reverts commit 58cc1675ec7b4aa5bc2dab56180cb7af1b23ade5.
I also made the incorrect assumption that we know both values are
+/-0.0 here as well. Revert for now.
When ordering signed zero, only check the sign of one of the values. We
already know at this point that both values must be +/-0.0, so it is
sufficient to check one of them to correctly order them.
For example, for fmaximum, if we know LHS is `+0.0` then we can always
select LHS, value of RHS does not matter. If LHS is `-0.0` we can always
select RHS, value of RHS doesn't matter.
FMAXIMUM is currently legalized via IS_FPCLASS for the signed zero
handling. This is problematic, because it assumes the equivalent integer
type is legal. Many targets have legal fp128, but illegal i128, so this
results in legalization failures.
Fix this by replacing IS_FPCLASS with checking the bitcast to integer
instead. In that case it is sufficient to use any legal integer type, as
we're just interested in the sign bit. This can be obtained via a stack
temporary cast. There is existing FloatSignAsInt functionality used for
legalization of FABS and similar we can use for this purpose.
Fixes https://github.com/llvm/llvm-project/issues/139380.
Fixes https://github.com/llvm/llvm-project/issues/139381.
Fixes https://github.com/llvm/llvm-project/issues/140445.
Added APInt::clearBits(unsigned loBit, unsigned hiBit) that clears bits within a certain range.
Fixes#136550
---------
Co-authored-by: Simon Pilgrim <llvm-dev@redking.me.uk>
Proposed by
[2ed1598](2ed15984b4):
`fshl X, (or X, Y), C ==/!= 0 --> or (srl Y, BW-C), X ==/!= 0`
This transformation is valid when (C%Bitwidth) != 0 , as verified by
[Alive2](https://alive2.llvm.org/ce/z/TQYM-m).
Fixes#136746
Based off feedback for #129695 - we need to be able to determine the
load offset of smaller loads when trying to determine whether a multiple
use load should be split (in particular for AVX subvector extractions).
This patch adds a std::optional<unsigned> ByteOffset argument to
shouldReduceLoadWidth calls for where we know the constant offset to
allow targets to make use of it in future patches.
This is a reland of #99752 with the bug fixed (see test diff in the
third commit in this PR).
All `popcount` libcalls return `int`, but `ISD::CTPOP` returns the type
of the argument, which can be wider than `int`. The fix is to make DAG
legalizer pass the correct return type to `makeLibCall` and sign-extend
the result afterwards.
Original commit message:
The main change is adding CTPOP to `RuntimeLibcalls.def` to allow
targets to use LibCall action for CTPOP. DAG legalizers are changed
accordingly.
Pull Request: https://github.com/llvm/llvm-project/pull/101786
I noticed these destructors taking time with -ftime-trace and moved some
of them for minor build efficiency improvements.
The main impact of moving destructors out of line is that it avoids
requiring container fields containing other types from being complete,
i.e. one can have uptr<T> or vector<T> as a field with an incomplete
type T, and that means we can reduce transitive includes, as with
LegalizerInfo.h.
Move expensive getDebugOperandsForReg template out-of-line. The
std::function instantiation shows up in time trace even if you don't use
the function.
FP_ROUND and FP_EXTEND the input value before FABSing it. This avoids
some bit twiddling to copy the sign bit from the input to the result. It
does introduce one extra FABS, but that is folded into another
instruction for free on AMDGPU, which is the only target currently
affected by this change.
This adds a call to SimplifyDemandedBits from bitcasts with scalar input
types in SimplifyDemandedVectorElts, which can help simplify the input
scalar.
This reverts commit 36eaf0daf5d6dd665d7c7a9ec38ea22f27709fed.
This is not a sound approach to dealing with this instruction change.
The new behavior is a different opcode pair, not a modifier on the
existing opcode.
For targets that support IEEE fminimum_num/fmaximum_num, the
corresponding *_min_num_fXY/*_max_num_fXY instructions themselves
already did the canonicalization for the inputs. As a result, we do not
need to explicitly canonicalize the inputs for fminnum/fmaxnum.
Add signed and unsigned PARTIAL_REDUCE_MLA ISD nodes. Add command line
argument (aarch64-enable-partial-reduce-nodes) that indicates whether the
intrinsic experimental_vector_partial_ reduce_add will be transformed
into the new ISD node. Lowering with the new ISD nodes will, for now,
always be done as an expand.
A BUILD_VECTOR can implicity shrink the bits of the operands if the
operand types are not legal. For example a v8i16 constant BUILD_VECTOR
might be represented as v8i16 BUILDVECTOR(i32 1, i32 2, ...).
Unfortunately this means that the constants are not accepted by
matchUnaryPredicateImpl, preventing in this case funnel shifts detecting
that all the operands are non-zero. Add a flag to help it match.
These functions have similar code. One of them calculates the 2x width
full product from 2 sources. The other calculates the product from 2
sources that have low and high halves.
This patch introduces a new function that takes HiLHS and HiRHS as
optional values. If they are not null, they will be used in the
calculation of the Hi half. The Signed flag can only be set when
HiLHS/HiRHS are null.
We have two forceExpandWideMUL functions. One takes the low and high
half of 2 inputs and calculates the low and high half of their product.
This does not calculate the full 2x width product.
The other signature takes 2 inputs and calculates the low and high half
of their full 2x width product. Previously it did this by sign/zero
extending the inputs to create the high bits and then calling the other
function.
We can instead copy the algorithm from the other function and use the
Signed flag to determine whether we should do SRA or SRL. This avoids
the need to multiply the high part of the inputs and add them to the
high half of the result. This improves the generated code for signed
multiplication.
This should improve the performance of #123262. I don't know yet how
close we will get to gcc.
Based on feedback from the clastb codegen PR, I'm refactoring basic codegen for the vector.extract.last.active intrinsic to lower to an ISD node in SelectionDAGBuilder then expand in LegalizeVectorOps, instead of doing everything in the builder.
The new ISD node (vector_find_last_active) only covers finding the index of the last active element of the mask, and extracting the element + handling passthru is left to existing ISD nodes.
I think SDNodeIterator primarily exists because GraphTraits requires an
iterator that dereferences to SDNode*. op_iterator dereferences to
SDUse* which is implicitly convertible to SDValue.
This piece of code can use SDValue instead of SDNode* so we should
prefer to use the the more common op_iterator.
Add DAG legalization support for expanding i1 SETCC nodes using
appropriate logical operations to simulate integer comparisons. Use
these expansions to handle i1 SETCC in NVPTX.
fixes#58428 and #57405
I want to use this function for GISel too so Type * is a better common
interface. All of the callers already convert EVT to Type * as needed
by calling lowering anyway.
This was overlooked in 7d940432c46be83b8fcb5dbefee439585fa820cd - when
inline assembly has multiple outputs, they are returned as members of a
struct, and the `getAsmOperandType` needs to be called for each member
of struct. The difference between this and the single-output case is
that in the latter, there isn't a struct wrapping the outputs.
I noticed this when trying to use the same mechanism in the RISC-V
backend.
Committing two tests:
- One that shows a crash before this change, which is fixed by this
change.
- One (commented out) that shows a different crash with tied
inputs/outputs. This is commented as it is not fixed by this change and
needs more work in target-independent inline asm handling code.