There are target intrinsics that logically require two MMOs, such as
llvm.amdgcn.global.load.lds, which is a copy from global memory to LDS,
so there's both a load and a store to different addresses.
Add an overload of getTgtMemIntrinsic that produces intrinsic info in a
vector, and implement it in terms of the existing (now protected)
overload.
GlobalISel and SelectionDAG paths are updated to support multiple MMOs.
The main part of this change is supporting multiple MMOs in
MemIntrinsicNodes.
Converting the backends to using the new overload is a fairly mechanical step
that is done in a separate change in the hope that that allows reducing merging
pains during review and for downstreams. A later change will then enable
using multiple MMOs in AMDGPU.
1. AIX has "millicode" routines, which are functions loaded at boot time
into fixed addresses in kernel memory. This allows them to be customized
for the processor. The __strcmp routine is a millicode implementation;
we use millicode for the strcmp function instead of a library call to
improve performance.
TLI.isBinOp recognises some opcodes that have multiple results,
including UADDO etc.
In most cases we currently just bail if a binop has multiple results,
but shuffle combining was missing the check and its pretty trivial to
add handling in this case.
I've added add/sub-overflow opcodes to verifyNode to help catch these
cases in the future - IIRC there was a plan to autogen these, but there
isn't anything at the moment.
Fixes#179112
This patch extends `BuildVectorSDNode::isConstantSequence` to recognize
constant sequences that contain undef elements at any position.
The new implementation finds the first two non-undef constant elements,
computes the stride from their difference, then verifies all other
defined elements match the sequence. This enables SVE's INDEX
instruction to be used in more cases.
This change particularly benefits ZIP1/ZIP2 patterns where one operand
is a constant sequence. When a smaller constant vector like `<0, 1, 2,
3>` is used in a ZIP1 shuffle producing a wider result, it gets expanded
with trailing undefs. Similarly, for ZIP2 patterns, the DAG combiner
transforms the constant to have leading undefs since ZIP2 only uses the
upper half of its operands.
In particular, these patterns arise naturally from `VectorCombine`'s
`compactShuffleOperands` optimization (see #176074) that I am suggesting
as a fix for #137447.
This change improves memset code generation for non-zero values on
AArch64 by using NEON's DUP instruction instead of
the less efficient multiplication with 0x01010101 pattern.
For small sizes, the value is extracted from a larger DUP. For
non-power-of-two sizes, overlapping stores are used in some cases.
TargetLowering::findOptimalMemOpLowering is modified to allow explicitly
specifying the size of the constant in cases where the constant is
larger than the store operations.
Fixes#165949
The previous getMemcpyLoadsAndStores implementation would chain
load/store instructions from "NumLdStInMemcpy - GlueIter -
GluedLdStLimit" to "NumLdStInMemcpy - GlueIter". This approach caused
issues when copying non-power-of-two sizes, as it would chain leading
load/stores with subsequent instructions at non-power-of-two aligned
offsets.
This chaining pattern prevented optimal optimizations in
aarch64-ldst-opt pass for these load/store instructions.
This commit modifies the chaining range to be from GlueIter to GlueIter
+ GluedLdStLimit, enabling proper optimization of load/store
instructions in aarch64-ldst-opt.
Closes https://github.com/llvm/llvm-project/issues/165947
This assert should not have existed, because just below it the code
bails out for that same condition. The case of the vector being a
scalable vector also shouldn't cause the compiler to crash with an
assertion failure, and instead it should just avoid analysing the
expression.
1. Implement `SelectionDAG::computeKnownBits` for TRUNCATE_SSAT_S/U and
TRUNCATE_USAT_U
2. Saturating truncation operations are well-defined for all inputs and
cannot create poison or undef values. This allows the optimizer to
eliminate unnecessary freeze instructions after these operations.
Fixes#152143
Extend the existing DAGCombine logic in visitIMINMAX so that signed and
unsigned MIN/MAX can be flipped not only when both operands are known
non-negative but also when both operands are known negative. This
replaces the old SignBitIsZero checks with computeKnownBits and explicit
tests for non-negative or negative operands while keeping all existing
legality and saturation gating in place. Add regression tests to cover
both the known-negative case and the known-non-negative case.
Fixes#174325
Libcall lowering decisions should come from the LibcallLoweringInfo
analysis. Query this through the DAG, so eventually the source
can be the analysis. For the moment this is just a wrapper around
the TargetLowering information.
AIX has "millicode" routines, which are functions loaded at boot time
into fixed addresses in kernel memory. This allows them to be customized
for the processor. The __strstr routine is a millicode implementation;
we use millicode for the strstr function instead of a library call to
improve performance.
I add a helper function `getRuntimeCallSDValueHelper` in the patch. I
will refactor the function `SelectionDAG::getStrlen`
`SelectionDAG::getStrcpy` etc later in another patch.
Add handling for CTLS using the same method as in
https://github.com/llvm/llvm-project/pull/174636.
Added tests to AArch64 and RISCV, but it seems that ARM is actually
resolving `llvm.arm.cls` to `clz`, so not tests added there.
AIX has "millicode" routines, which are functions loaded at boot time
into fixed addresses in kernel memory. This allows them to be customized
for the processor. The __strcpy routine is a millicode implementation;
we use millicode for the strcpy function instead of a library call to
improve performance.
---------
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
Range metadata was handled in a ISD::LOAD case in the main opcode
switch. Extending loads and constant pools were handled with special
code after the main switch. Move this code into the ISD::LOAD case of
the main switch.
There is one slight change here, I put the Op.getResNo() == 0 check
before the range handling. This should be more correct.
This PR implements the first change outlined in
https://discourse.llvm.org/t/rfc-allow-non-constant-offsets-in-llvm-vector-splice/88974?u=lukel
In order to allow non-immediate offsets in the llvm.vector.splice
intrinsic, we need to separate out the "shift left" and "shift right"
modes into two separate intrinsics, which were previously determined by
whether or not the offset is positive or negative.
The description in the LangRef has also been reworded in terms of
sliding elements left or right and extracting either the upper or lower
half as opposed to extracting from a certain index, which brings it
inline with the definition of `llvm.fshr.*`/`llvm.fshl.*`.
This patch teaches AutoUpgrade.cpp to upgrade the old intrinsics into
their new equivalent one based on their offset, so existing uses of
vector.splice should still work.
Uses of llvm.vector.splice in `llvm/test/CodeGen` haven't been replaced
in this PR to keep the diff small and kick the tyres on the AutoUpgrader
a bit. I planned to do this in a follow up NFC but can include it in
this PR if reviewers prefer.
Similarly the shuffle costing kind `SK_Splice` has just been kept the
same for now, to be split into `SK_SpliceLeft` and `SK_SpliceRight`
later.
In line with a std proposal to introduce the llvm.clmul family of
intrinsics corresponding to carry-less multiply operations. This work
builds upon 727ee7e ([APInt] Introduce carry-less multiply primitives),
and follow-up patches will introduce custom-lowering on supported
targets, replacing target-specific clmul intrinsics.
Testing is done on the RISC-V target, which should be sufficient to
prove that the intrinsics work, since no RISC-V specific lowering has
been added.
Ref: https://isocpp.org/files/papers/P3642R3.html
Co-authored-by: Craig Topper <craig.topper@sifive.com>
Treat these like other shift operations by allowing the shift amount to
be a different type than the result.
The PromoteIntOp_Shift and LegalizeDAG code are not tested due to lack
of target support.
I'm looking at adding SSHLSAT for the RISC-V P extension. I don't need
this support for that since RISC-V only has one legal type. I just thought it
was odd that they weren't like other shifts.
If the sign bit of the denominator is known 0, do not emit the fabs.
Also, extend this to handle min/max with fabs inputs.
I originally tried to do this as the general combine on fabs, but
it proved to be too much trouble at this time. This is mostly
complexity introduced by expanding the various min/maxes into
canonicalizes, and then not being able to assume the sign bit
of canonicalize (fabs x) without nnan.
This defends against future code size regressions in the atan2 and
atan2pi library functions.
Some floating-point optimization don't trigger because they can produce
incorrect results around signed zeros, and rely on the existence of the
nsz flag which commonly appears when fast-math is enabled.
However, this flag is not a hard requirement when all of the users of
the combined value are either guaranteed to overwrite the sign-bit or
simply ignore it (comparisons, etc.).
The optimizations affected:
- fadd x, +0.0 -> x
- fsub x, -0.0 -> x
- fsub +0.0, x -> fneg x
- fdiv(x, sqrt(x)) -> sqrt(x)
- frem lowering with power-of-2 divisors
If the upper bits are zero, but we expand multiply then immediately
convert the multiple into a libcall, there is no opportunity to optimize
away the mul. Do so in getNode to make sure extending multiplies
optimise cleanly.
This PR implements catch all handling for widening the scalable
subvector operand (INSERT_SUBVECTOR) or result (EXTRACT_SUBVECTOR). It
does this via the stack using masked memory operations. With general
handling available we can add optimiations for specific cases.
Similar to how getElementCount avoids the need to reason about fixed and
scalable ElementCounts separately, this patch adds getTypeSize to do the
same for TypeSize.
It also goes through and replaces some of the manual uses of getVScale
with getTypeSize/getElementCount where possible.
If vector-unaligned-mem support is not enabled, we should not generate
loads/stores that are not aligned to their element size.
We already do this for non-VP vector loads/stores.
This code has been in our downstream for about a year and a half after
finding the vectorizer generating misaligned loads/stores. I don't think
that is unique to our downstream.
Doing this for masked vp.load/store requires widening the mask as well
which is harder to do.
NOTE: Because we have to scale the VL, this will introduce additional
vsetvli and the VL optimizer will not be effective at optimizing any
arithmetic that is consumed by the store.