The llvm.cond.loop intrinsic is semantically equivalent to a conditional
branch conditioned on ``pred`` to a basic block consisting only of an
unconditional branch to itself. Unlike such a branch, it is guaranteed
to use specific instructions. This allows an interrupt handler or
other introspection mechanism to straightforwardly detect whether
the program is currently spinning in the infinite loop and possibly
terminate the program if so. The intent is that this intrinsic may
be used as a more efficient alternative to a conditional branch to
a call to ``llvm.trap`` in circumstances where the loop detection
is guaranteed to be present. This construct has been experimentally
determined to be executed more efficiently (when the branch is not taken)
than a conditional branch to a trap instruction on AMD and older Intel
microarchitectures, and is also more code size efficient by avoiding the
need to emit a trap instruction and possibly a long branch instruction.
On i386 and x86_64, the infinite loop is guaranteed to consist of a short
conditional branch instruction that branches to itself. Specifically,
the first byte of the instruction will be between 0x70 and 0x7F, and
the second byte will be 0xFE.
Part of this RFC:
https://discourse.llvm.org/t/rfc-optimizing-conditional-traps/89456
Reviewers: arsenm, RKSimon, fmayer, vitalybuka
Pull Request: https://github.com/llvm/llvm-project/pull/177686
There are target intrinsics that logically require two MMOs, such as
llvm.amdgcn.global.load.lds, which is a copy from global memory to LDS,
so there's both a load and a store to different addresses.
Add an overload of getTgtMemIntrinsic that produces intrinsic info in a
vector, and implement it in terms of the existing (now protected)
overload.
GlobalISel and SelectionDAG paths are updated to support multiple MMOs.
The main part of this change is supporting multiple MMOs in
MemIntrinsicNodes.
Converting the backends to using the new overload is a fairly mechanical step
that is done in a separate change in the hope that that allows reducing merging
pains during review and for downstreams. A later change will then enable
using multiple MMOs in AMDGPU.
1. AIX has "millicode" routines, which are functions loaded at boot time
into fixed addresses in kernel memory. This allows them to be customized
for the processor. The __strcmp routine is a millicode implementation;
we use millicode for the strcmp function instead of a library call to
improve performance.
Remove getPrefTypeAlign calls and use only the alloca's explicit
alignment, since the type may not be semantically useful, there is no
useful reason to change alignment to support it.
The alloca's explicit alignment (from getAlign()) is already optimally
correct; we don't need to derive alignment from the allocated type.
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
This is a followup to https://github.com/llvm/llvm-project/pull/171288,
which removed lowering of libcalls to SDAG nodes for most libcalls that
get unconditionally canonicalized to intrinsics. This handles the
remaining fabs case, which I originally skipped due to larger test
impact.
Following on from #170796, this PR implements the second part of
https://discourse.llvm.org/t/rfc-allow-non-constant-offsets-in-llvm-vector-splice/88974
by allowing non-constant offsets in the vector splice intrinsics.
Previously @llvm.vector.splice had a restriction enforced by the
verifier that the offset had to be known to be within the range of the
vector at compile time. Because we can't enforce this with non-constant
offsets, it's been relaxed so that offsets that would slide the vector
out of bounds return a poison value, similar to
insertelement/extractelement.
@llvm.vector.splice.left also previously only allowed offsets within the
range 0 <= Offset < N, but this has been relaxed to 0 <= Offset <= N so
that it's consistent with @llvm.vector.splice.right.
In lieu of the verifier checks that were removed, InstSimplify has been
taught to fold splices to poison when the offset is out of bounds.
The cost model isn't implemented in this PR, and just returns invalid
for any non-constant offsets for now. I think the correct way to cost
these non-constant offets isn't through getShuffleCost because they
can't handle variable masks, but instead just through
getIntrinsicInstCost.
These are long overdue for removal. These were originally a hack
to support loading half values before there was any / decent support
for the half type through the backend. There's no reason to continue
supporting these, they're equivalent to fpext/fptrunc with a bitcast.
SelectionDAG stopped translating these directly, and used the
bitcast + fp cast since f7a02c17628e825, so there's been no reason
to use these since 2014.
Replace patterns that manually compute allocation sizes by multiplying
getTypeAllocSize(getAllocatedType()) by the array size with calls to the
getAllocationSize(DL) API, which handles this correctly and concisely,
returning nullopt for VLAs.
This fixes several places that were not accounting for array allocations
when computing sizes, simplifies code that was doing this manually, and
adds some explicit isFixed checks where implied convert was being used.
This PR is because now that we have opaque pointers, I hate that some
AllocaInst still has type information being consumed by some passes
instead of just using the size, since passes rarely handle that type
information well or correctly. I hope this will grow into a sequence of
commits to slowly eliminate uses of getAllocatedType from AllocaInst.
And similarly later to remove type information from GlobalValue too (it
can be replaced with just dereferenceable bytes, similar to arguments).
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
As discussed on https://github.com/llvm/llvm-project/pull/144745, insert
a nop after unwinding inline assembly, as it may end on a call.
While the change itself is trivial, I ended up having to do two
infrastructure changes:
* The unwind flag needs to be propagated to ExtraInfo of the
MachineInstr.
* The MachineInstr needs to be passed through to emitInlineAsmEnd(), and
the method needs to be non-const.
Fixes https://github.com/llvm/llvm-project/issues/157073.
AIX has "millicode" routines, which are functions loaded at boot time
into fixed addresses in kernel memory. This allows them to be customized
for the processor. The __strstr routine is a millicode implementation;
we use millicode for the strstr function instead of a library call to
improve performance.
I add a helper function `getRuntimeCallSDValueHelper` in the patch. I
will refactor the function `SelectionDAG::getStrlen`
`SelectionDAG::getStrcpy` etc later in another patch.
AIX has "millicode" routines, which are functions loaded at boot time
into fixed addresses in kernel memory. This allows them to be customized
for the processor. The __strcpy routine is a millicode implementation;
we use millicode for the strcpy function instead of a library call to
improve performance.
---------
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
Add support for `__builtin_stack_address` builtin. The semantics match
those of GCC's builtin with the same name.
`__builtin_stack_address` returns the starting address of the stack
region that may be used by called functions. It may or may not include
the space used for on-stack arguments passed to a callee (See [GCC
Bug/121013](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121013)).
Fixes#82632.
This PR implements the first change outlined in
https://discourse.llvm.org/t/rfc-allow-non-constant-offsets-in-llvm-vector-splice/88974?u=lukel
In order to allow non-immediate offsets in the llvm.vector.splice
intrinsic, we need to separate out the "shift left" and "shift right"
modes into two separate intrinsics, which were previously determined by
whether or not the offset is positive or negative.
The description in the LangRef has also been reworded in terms of
sliding elements left or right and extracting either the upper or lower
half as opposed to extracting from a certain index, which brings it
inline with the definition of `llvm.fshr.*`/`llvm.fshl.*`.
This patch teaches AutoUpgrade.cpp to upgrade the old intrinsics into
their new equivalent one based on their offset, so existing uses of
vector.splice should still work.
Uses of llvm.vector.splice in `llvm/test/CodeGen` haven't been replaced
in this PR to keep the diff small and kick the tyres on the AutoUpgrader
a bit. I planned to do this in a follow up NFC but can include it in
this PR if reviewers prefer.
Similarly the shuffle costing kind `SK_Splice` has just been kept the
same for now, to be split into `SK_SpliceLeft` and `SK_SpliceRight`
later.
In line with a std proposal to introduce the llvm.clmul family of
intrinsics corresponding to carry-less multiply operations. This work
builds upon 727ee7e ([APInt] Introduce carry-less multiply primitives),
and follow-up patches will introduce custom-lowering on supported
targets, replacing target-specific clmul intrinsics.
Testing is done on the RISC-V target, which should be sufficient to
prove that the intrinsics work, since no RISC-V specific lowering has
been added.
Ref: https://isocpp.org/files/papers/P3642R3.html
Co-authored-by: Craig Topper <craig.topper@sifive.com>
Previously, we would crash in the SelectionDAGBuilder when attempting to
create debug fragments for scalable vectors split across multiple
registers.
It does not seem like DW_OP_LLVM_fragment supports any notion of
scalable type sizes. It takes both an offset and typesize as literals,
with no indication of scalability (and it also does not seem to be
considered in any of the places that handle DW_OP_LLVM_fragment). So the
workaround here is to drop the debug info.
Note: This is not usually an issue for IR that comes from the SVE ACLE,
as we generally stick to using legal types there (that don't end up
getting split).
Workaround for: #161289
The first operand should be a chain, but `GuardVal.getOperand(0)` isn't
always a chain (i.e. if `TLI.emitStackGuardXorFP()` is called). Use
`getControlRoot()` instead like in other places when creating terminator
nodes.
Extracted from #168421.
Treat these like other shift operations by allowing the shift amount to
be a different type than the result.
The PromoteIntOp_Shift and LegalizeDAG code are not tested due to lack
of target support.
I'm looking at adding SSHLSAT for the RISC-V P extension. I don't need
this support for that since RISC-V only has one legal type. I just thought it
was odd that they weren't like other shifts.
LowerCallTo() and LowerArguments() are both providing the PartOffset field for
each split argument part. As these two methods are intended to work together,
they should both provide the same offsets. However, LowerArguments() has been
providing the offset from the beginning of the struct while LowerCallTo() sets it
relative to the first split part.
This patch removes the PartBase variable in LowerArguments() so that the behavior
matches LowerCallTo(): offsets to split parts of an argument are relative to the first
part of the argument.
Both `LOOP_DEPENDENCE_WAR_MASK` and `LOOP_DEPENDENCE_RAW_MASK` are
currently hard to split correctly, and there are a number of incorrect
cases.
The difficulty comes from how the intrinsics are defined. For example,
take `LOOP_DEPENDENCE_WAR_MASK`.
It is defined as the OR of:
* `(ptrB - ptrA) <= 0`
* `elementSize * lane < (ptrB - ptrA)`
Now, if we want to split a loop dependence mask for the high half of the
mask we want to compute:
* `(ptrB - ptrA) <= 0`
* `elementSize * (lane + LoVT.getElementCount()) < (ptrB - ptrA)`
However, with the current opcode definitions, we can only modify ptrA or
ptrB, which may change the result of the first case, which should be
invariant to the lane.
This patch resolves these cases by adding a "lane offset" to the ISD
opcodes. The lane offset is always a constant. For scalable masks, it is
implicitly multiplied by vscale.
This makes splitting trivial as we increment the lane offset by
`LoVT.getElementCount()` now.
Note: In the AArch64 backend, we only support zero lane offsets (as
other cases are tricky to lower to whilewr/rw).
---------
Co-authored-by: Benjamin Maxwell <benjamin.maxwell@arm.com>
This is a followup to https://github.com/llvm/llvm-project/pull/171114,
removing the handling for most libcalls that are already canonicalized
to intrinsics in the middle-end. The only remaining one is fabs, which
has more test coverage than the others.
SDAG currently tries to lower certain libcalls to ISD opcodes. However,
many of these are already canonicalized from libcalls to intrinsic in
the middle-end (and often already emitted as intrinsics in the
front-end).
I believe that SDAG should not be doing anything for such libcalls. This
PR just drops a single libcall to get consensus on the direction, as
these changes need a non-trivial amount of test updates.
A lot of the remaining libcalls *should* probably also be canonicalized
to intrinsics in the middle-end when annotated with `memory(none)`, but
that would require additional work in SimplifyLibCalls.
This commit adds support for using intrinsics with callbr. The uses of
this will most of the time look like this example:
```llvm
callbr void @llvm.amdgcn.kill(i1 %c) to label %cont [label %kill]
kill:
unreachable
cont:
...
```
Similar to how getElementCount avoids the need to reason about fixed and
scalable ElementCounts separately, this patch adds getTypeSize to do the
same for TypeSize.
It also goes through and replaces some of the manual uses of getVScale
with getTypeSize/getElementCount where possible.
This backend support will allow the LoadStoreVectorizer, in certain
cases, to fill in gaps when creating load/store vectors and generate
LLVM masked load/stores
(https://llvm.org/docs/LangRef.html#llvm-masked-store-intrinsics). To
accomplish this, changes are separated into two parts. This first part
has the backend lowering and TTI changes, and a follow up PR will have
the LSV generate these intrinsics:
https://github.com/llvm/llvm-project/pull/159388.
In this backend change, Masked Loads get lowered to PTX with `#pragma
"used_bytes_mask" [mask];`
(https://docs.nvidia.com/cuda/parallel-thread-execution/#pragma-strings-used-bytes-mask).
And Masked Stores get lowered to PTX using the new sink symbol syntax
(https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-st).
# TTI Changes
TTI changes are needed because NVPTX only supports masked loads/stores
with _constant_ masks. `ScalarizeMaskedMemIntrin.cpp` is adjusted to
check that the mask is constant and pass that result into the TTI check.
Behavior shouldn't change for non-NVPTX targets, which do not care
whether the mask is variable or constant when determining legality, but
all TTI files that implement these API need to be updated.
# Masked store lowering implementation details
If the masked stores make it to the NVPTX backend without being
scalarized, they are handled by the following:
* `NVPTXISelLowering.cpp` - Sets up a custom operation action and
handles it in lowerMSTORE. Similar handling to normal store vectors,
except we read the mask and place a sentinel register `$noreg` in each
position where the mask reads as false.
For example,
```
t10: v8i1 = BUILD_VECTOR Constant:i1<-1>, Constant:i1<0>, Constant:i1<0>, Constant:i1<-1>, Constant:i1<-1>, Constant:i1<0>, Constant:i1<0>, Constant:i1<-1>
t11: ch = masked_store<(store unknown-size into %ir.lsr.iv28, align 32, addrspace 1)> t5:1, t5, t7, undef:i64, t10
->
STV_i32_v8 killed %13:int32regs, $noreg, $noreg, killed %16:int32regs, killed %17:int32regs, $noreg, $noreg, killed %20:int32regs, 0, 0, 1, 8, 0, 32, %4:int64regs, 0, debug-location !18 :: (store unknown-size into %ir.lsr.iv28, align 32, addrspace 1);
```
* `NVPTXInstInfo.td` - changes the definition of store vectors to allow
for a mix of sink symbols and registers.
* `NVPXInstPrinter.h/.cpp` - Handles the `$noreg` case by printing "_".
# Masked load lowering implementation details
Masked loads are routed to normal PTX loads, with one difference: a
`#pragma "used_bytes_mask"` is emitted before the load instruction
(https://docs.nvidia.com/cuda/parallel-thread-execution/#pragma-strings-used-bytes-mask).
To accomplish this, a new operand is added to every NVPTXISD Load type
representing this mask.
* `NVPTXISelLowering.h/.cpp` - Masked loads are converted into normal
NVPTXISD loads with a mask operand in two ways. 1) In type legalization
through replaceLoadVector, which is the normal path, and 2) through
LowerMLOAD, to handle the legal vector types
(v2f16/v2bf16/v2i16/v4i8/v2f32) that will not be type legalized. Both
share the same convertMLOADToLoadWithUsedBytesMask helper. Both default
this operand to UINT32_MAX, representing all bytes on. For the latter,
we need a new `NVPTXISD::MLoadV1` type to represent that edge case
because we cannot put the used bytes mask operand on a generic
LoadSDNode.
* `NVPTXISelDAGToDAG.cpp` - Extract used bytes mask from loads, add them
to created machine instructions.
* `NVPTXInstPrinter.h/.cpp` - Print the pragma when the used bytes mask
isn't all ones.
* `NVPTXForwardParams.cpp`, `NVPTXReplaceImageHandles.cpp` - Update
manual indexing of load operands to account for new operand.
* `NVPTXInsrtInfo.td`, `NVPTXIntrinsics.td` - Add the used bytes mask to
the MI definitions.
* `NVPTXTagInvariantLoads.cpp` - Ensure that masked loads also get
tagged as invariant.
Some generic changes that are needed:
* `LegalizeVectorTypes.cpp` - Ensure flags are preserved when splitting
masked loads.
* `SelectionDAGBuilder.cpp` - Preserve `MD_invariant_load` on masked
load SDNode creation
the patch
[Add strictfp attribute to prevent unwanted optimizations of libm
calls](https://reviews.llvm.org/D34163)
add `I.isStrictFP()` into
```
if (!I.isNoBuiltin() && !I.isStrictFP() && !F->hasLocalLinkage() &&
F->hasName() && LibInfo->getLibFunc(*F, Func) &&
LibInfo->hasOptimizedCodeGen(Func))
```
it prevents the backend from optimizing even non-math libcalls such as
`strlen` and `memcmp` if a call has the strict floating-point attribute.
For example, it prevent converting strlen and memcmp to milicode call
__strlen and __memcmp.
This defends against regressions in future patches. This excludes
the target intrinsic case for now; I'm worried introducing an
intermediate
AssertNoFPClass is likely to break combines.
This intrinsic emits a BFD_RELOC_NONE relocation at the point of call,
which allows optimizations and languages to explicitly pull in symbols
from static libraries without there being any code or data that has an
effectual relocation against such a symbol.
See issue #146159 for context.
Refactor intrinsic call handling in SelectionDAGBuilder and IRTranslator
to prepare the addition of intrinsic support to the callbr instruction,
which should then share code with the handling of the normal call
instruction.
This is to follow the discussion in
https://github.com/llvm/llvm-project/pull/164565
CallBase can cover more call-like instructions which carry caling
convention flag.
Co-authored-by: Yuanke Luo <ykluo@birentech.com>