Add the llvm.amdgcn.call.whole.wave intrinsic for calling whole wave
functions. This will take as its first argument the callee with the
amdgpu_gfx_whole_wave calling convention, followed by the call
parameters which must match the signature of the callee except for the
first function argument (the i1 original EXEC mask, which doesn't need
to be passed in). Indirect calls are not allowed.
Make direct calls to amdgpu_gfx_whole_wave functions a verifier error.
Unspeakable horrors happen around calls from whole wave functions, the
plan is to improve the handling of caller/callee-saved registers in
a future patch.
Tail calls are also handled in a future patch.
This slightly relaxes the invariant established in #149310, by also
allowing the lifetime argument to be poison. This is to support the
typical pattern of RAUWing with poison when removing an instruction.
It's worth noting that this does not require any conservative
assumptions, lifetimes with poison arguments can simply be skipped.
Fixes https://github.com/llvm/llvm-project/issues/151119.
Split out from https://github.com/llvm/llvm-project/pull/150248:
Specify that the argument of lifetime.start/lifetime.end is ignored and
will be removed in the future.
Remove lifetime size handling from SDAG. The size was previously
discarded during isel, so was always ignored for stack coloring anyway.
Where necessary, obtain the size of the full frame index.
After https://github.com/llvm/llvm-project/pull/149310 we are guaranteed
that the argument is an alloca, so we don't need to look at underlying
objects (which was not a correct thing to do anyway).
This also drops the offset argument for lifetime nodes in SDAG. The
offset is fixed to zero now. (Peculiarly, while SDAG pretended to have
an offset, it just gets silently dropped during selection.)
When generating SDAG for a getelementptr with a vector result, we were
previously generating splats for each scalar operand. This essentially
has the effect of aggressively vectorizing the sequence, and leaving it
later combines to scalarize if profitable.
Instead, we can keep the accumulating address as a scalar for as long as
the prefix of operands allows before lazily converting to vector on the
first vector operand. This both better fits hardware which frequently
has a scalar base on the scatter/gather instructions, and reduces the
addressing cost even when not as otherwise we end up with a scalar to
vector domain crossing for each scalar operand.
Note that constant splat offsets are treated as scalar for the above,
and only variable offsets can force a conversion to vector.
---------
Co-authored-by: Craig Topper <craig.topper@sifive.com>
Also fix the LangRef to match the implementation. This was checking
against the alloca address space size rather than the default address
space.
The check was also more permissive than the LangRef. The error
check permitted any size less than the pointer size; follow the
stricter wording of the LangRef.
Seeing how we can't generate any debug intrinsics any more: delete a
variety of codepaths where they're handled. For the most part these are
plain deletions, in others I've tweaked comments to remain coherent, or
added a type to (what was) type-generic-lambdas.
This isn't all the DbgInfoIntrinsic call sites but it's most of the
simple scenarios.
Co-authored-by: Nikita Popov <github@npopov.com>
ConstantInt vectors utilise DAG.getConstant() when constructing the
initial DAG. This can have the effect of legalising the constant before
the DAG combiner is run, significant altering the generated code. To
mitigate this (hopefully as a temporary measure) we instead try to
construct the DAG in the same way as shufflevector based splats.
This patch optimizes the Windows security cookie check mechanism by
moving the comparison inline and only calling __security_check_cookie
when the check fails. This reduces the overhead of making a DLL call
for every function return.
Previously, we implemented this optimization through a machine pass
(X86WinFixupBufferSecurityCheckPass) in PR #95904 submitted by
@mahesh-attarde. We have reverted that pass in favor of this new
approach. Also we have abandoned the AArch64 specific implementation
of same pass in PR #121938 in favor of this more general solution.
The old machine instruction pass approach:
- Scanned the generated code to find __security_check_cookie calls
- Modified these calls by splitting basic blocks
- Added comparison logic and conditional branching
- Required complex block management and live register computation
The new approach:
- Implements the same optimization during instruction selection
- Directly emits the comparison and conditional branching
- No need for post-processing or basic block manipulation
- Disables optimization at -Oz.
Thanks @tamaspetz, @efriedma-quic and @arsenm for their help.
In 'asm goto' statements ('callbr' in LLVM IR), you can specify one or
more labels / basic blocks in the containing function which the assembly
code might jump to. If you're also compiling with branch target
enforcement via BTI, then previously listing a basic block as a possible
jump destination of an asm goto would cause a BTI instruction to be
placed at the start of the block, in case the assembly code used an
_indirect_ branch instruction (i.e. to a destination address read from a
register) to jump to that location. Now it doesn't do that any more:
branches to destination labels from the assembly code are assumed to be
direct branches (to a relative offset encoded in the instruction), which
don't require a BTI at their destination.
This change was proposed in https://discourse.llvm.org/t/85845 and there
seemed to be no disagreement. The rationale is:
1. it brings clang's handling of asm goto in Arm and AArch64 in line
with gcc's, which didn't generate BTIs at the target labels in the first
place.
2. it improves performance in the Linux kernel, which uses a lot of 'asm
goto' in which the assembly language just contains a NOP, and the
label's address is saved elsewhere to let the kernel self-modify at run
time to swap between the original NOP and a direct branch to the label.
This allows hot code paths to be instrumented for debugging, at only the
cost of a NOP when the instrumentation is turned off, instead of the
larger cost of an indirect branch. In this situation a BTI is
unnecessary (if the branch happens it's direct), and since the code
paths are hot, also a noticeable performance hit.
Implementation:
`SelectionDAGBuilder::visitCallBr` is the place where 'asm goto' target
labels are handled. It calls `setIsInlineAsmBrIndirectTarget()` on each
target `MachineBasicBlock`. Previously it also called
`setMachineBlockAddressTaken()`, which made `hasAddressTaken()` return
true, which caused a BTI to be added in the Arm backends.
Now `visitCallBr` doesn't call `setMachineBlockAddressTaken()` any more
on asm goto targets, but `hasAddressTaken()` also checks the flag set by
`setIsInlineAsmBrIndirectTarget()`. So call sites that were using
`hasAddressTaken()` don't need to be modified. But the Arm backends
don't call `hasAddressTaken()` any more: instead they test two more
specific query functions that cover all the reasons `hasAddressTaken()`
might have returned true _except_ being an asm goto target.
Testing:
The new test `AArch64/callbr-asm-label-bti.ll` is testing the actual
change, where it expects not to see a `bti` instruction after
`[[LABEL]]`. The rest of the test changes are all churn, due to the
flags on basic blocks changing. Actual output code hasn't changed in any
of the existing tests, only comments and diagnostics.
Further work:
`RISCVIndirectBranchTracking.cpp` and `X86IndirectBranchTracking.cpp`
also call `hasAddressTaken()` in a way that might benefit from using the
same more specific check I've put in `ARMBranchTargets.cpp` and
`AArch64BranchTargets.cpp`. But I'm not sure of that, so in this commit
I've only changed the Arm backends, and left those alone.
In https://reviews.llvm.org/D157685 I changed SDAG to only transfer
range metadata to SDAG if it also has !noundef. At the time, this was
necessary because SDAG incorrectly propagated poison when folding
logical and/or to bitwise and/or.
The root cause of that issue has since been addressed by
https://github.com/llvm/llvm-project/pull/84924, so drop the workaround
now.
This opcode represents the addition of a pointer value (first operand)
and an integer offset (second operand). PTRADD nodes are only generated
if the TargetMachine opts in by overriding
TargetMachine::shouldPreservePtrArith().
The PTRADD node and respective visitPTRADD() function were adapted by
@rgwott from the CHERI/Morello LLVM tree.
Original authors: @davidchisnall, @jrtc27, @arichardson.
The changes in this PR were extracted from PR #105669.
---------
Co-authored-by: David Chisnall <github@theravensnest.org>
Co-authored-by: Jessica Clarke <jrtc27@jrtc27.com>
Co-authored-by: Alexander Richardson <alexrichardson@google.com>
Co-authored-by: Rodolfo Wottrich <rodolfo.wottrich@arm.com>
This adds [de]interleave intrinsics for factors of 4,6,8, so that every
interleaved memory operation supported by the in-tree targets can be
represented by a single intrinsic.
For context, [de]interleaves of fixed-length vectors are represented by
a series of shufflevectors. The intrinsics are needed for scalable
vectors, and we don't currently scalably vectorize all possible factors
of interleave groups supported by RISC-V/AArch64.
The underlying reason for this is that higher factors are currently
represented by interleaving multiple interleaves themselves, which made
sense at the time in the discussion in
https://github.com/llvm/llvm-project/pull/89018.
But after trying to integrate these for higher factors on RISC-V I think
we should revisit this design choice:
- Matching these in InterleavedAccessPass is non-trivial: We currently
only support factors that are a power of 2, and detecting this requires
a good chunk of code
- The shufflevector masks used for [de]interleaves of fixed-length
vectors are much easier to pattern match as they are strided patterns,
but for the intrinsics it's much more complicated to match as the
structure is a tree.
- Unlike shufflevectors, there's no optimisation that happens on
[de]interleave2 intriniscs
- For non-power-of-2 factors e.g. 6, there are multiple possible ways a
[de]interleave could be represented, see the discussion in #139373
- We already have intrinsics for 2,3,5 and 7, so by avoiding 4,6 and 8
we're not really saving much
By representing these higher factors are interleaved-interleaves, we can
in theory support arbitrarily high interleave factors. However I'm not
sure this is actually needed in practice: SVE only has instructions
for factors 2,3,4, whilst RVV only supports up to factor 8.
This patch would make it much easier to support scalable interleaved
accesses in the loop vectorizer for RISC-V for factors 3,5,6 and 7, as
the loop vectorizer and InterleavedAccessPass wouldn't need to
construct and match trees of interleaves.
For interleave factors above 8, for which there are no hardware memory
operations to match in the InterleavedAccessPass, we can still keep the
wide load + recursive interleaving in the loop vectorizer.
Of the 128-bits of buffer descriptor only 48 bits are address bits, so
following the discussion on https://discourse.llvm.org/t/clarifiying-the-semantics-of-ptrtoint/83987/54,
the logic conclusion is to set the index width to 48 bits instead of
the current value of 128.
Most of the test changes are mechanical datalayout updates, but there
is one actual change: the ptrmask test now uses .i48 instead of .i128
and I had to update SelectionDAGBuilder to correctly extend the mask.
Reviewed By: krzysz00
Pull Request: https://github.com/llvm/llvm-project/pull/139419
For now expansion still happens in SelectionDAGBuilder when
GET_ACTIVE_LANE_MASK is not legal on the target.
This patch also includes changes in AArch64ISelLowering to replace
handling of the get.active.lane.mask intrinsic to use the ISD node.
Tablegen patterns are added which match to whilelo for scalable types.
A follow up change will add support for more types to be lowered to
GET_ACTIVE_LANE_MASK by allowing splitting of the node.
It is used to mark a value that we are sure that it is not some fcType.
The examples include:
* An arguments of a function is marked with nofpclass
* Output value of an intrinsic can be sure to not be some type
So that the following operation can make some assumptions.
Migrate their usage to the `AnyMem*Inst` family, and add a isAtomic()
query on the base class for that hierarchy. This matches the idioms we
use for e.g. isAtomic on load, store, etc.. instructions, the existing
isVolatile idioms on mem* routines, and allows us to more easily share
code between atomic and non-atomic variants.
As with #138568, the goal here is to simplify the class hierarchy and
make it easier to reason about. I'm moving from easiest to hardest, and
will stop at some point when I hit "good enough". Longer term, I'd sorta
like to merge or reverse the naming on the plain Mem*Inst and the
AnyMem*Inst, but that's a much larger and more risky change. Not sure
I'm going to actually do that.
This is a follow up to c0a264e, but note that there is a functional
difference here: the root changes for the memcpy.inline case. This
difference appears to have been accidental, but I kept this back to
facility separate review in case there's something I'm missing here.
This is a reland of #138434 except that:
- the bits for llvm/lib/CodeGen/RenameIndependentSubregs.cpp
have been dropped because they caused a test failure under asan, and
- the bits for llvm/lib/CodeGen/SelectionDAG/ScheduleDAGFast.cpp have
been improved with structured bindings.
This reverts commit a9699a334bc9666570418a3bed9520bcdc21518b.
Breaks CodeGen/AMDGPU/collapse-endcf.ll in several configs
(sanitizer builds; macOS; possibly more), see comments on
https://github.com/llvm/llvm-project/pull/138434
This patch adds support for LLVM IR atomicrmw `fmaximum` and `fminimum`
instructions.
These mirror the `llvm.maximum.*` and `llvm.minimum.*` instructions, but
are atomic and use IEEE754 2019 handling for NaNs, which is different to
`fmax` and `fmin`. See:
https://llvm.org/docs/LangRef.html#llvm-minimum-intrinsic
for more details.
Future changes will allow this LLVM IR to be lowered to specialised
assembler instructions on suitable targets, such as AArch64.
This patch adds support for LLVM IR atomicrmw `fmaximum` and `fminimum`
instructions.
These mirror the `llvm.maximum.*` and `llvm.minimum.*` instructions, but
are atomic and use IEEE754 2019 handling for NaNs, which is different to
`fmax` and `fmin`. See:
https://llvm.org/docs/LangRef.html#llvm-minimum-intrinsic
for more details.
Future changes will allow this LLVM IR to be lowered to specialised
assembler instructions on suitable targets, such as AArch64.
It is used to mark a value that we are sure that it is not some fcType.
The examples include:
* An arguments of a function is marked with nofpclass
* Output value of an intrinsic can be sure to not be some type
So that the following operation can make some assumptions.
---------
Co-authored-by: Your Name <you@example.com>
And teach SelectionDAGBuilder to get the range metadata in
visitAtomicLoad.
This allows us to recognize that sign extending a byte load of a
boolean value from memory will produce zeros for the extended bits.
This allow us to remove an AND on RISC-V.
Tests copied from #136502 with range metadata added to i1 cases.
Some of the test effects overlap with #136502, but that patch can't
handle the acquire or seq_cst cases with the Zalasr extension. We
only have sign extending versions of those loads.
Rename one signature of getAtomic to getAtomicLoad and pass LoadExtType.
Previously we had to set the extension type after the node was created,
but we don't usually modify SDNodes once they are created. It's possible
the node already existed and has been CSEd. If that happens, modifying
the node may affect the other users. It's therefore safer to add the
extension type at creation so that it is part of the CSE information.
I don't know of any failures related to the current implementation. I
only noticed that it doesn't match how we usually do things.
This change removes the uint64_t constructor on LocationSize
preventing implicit conversion, and fixes up the using APIs to adapt to
the change. Note that I'm adding a couple of explicit conversion points
on routines where passing in a fixed offset as an integer seems likely
to have well understood semantics.
We had an unfortunate case which arose if you tried to pass a TypeSize
value to a parameter of LocationSize type. We'd find the implicit
conversion path through TypeSize -> uint64_t -> LocationSize which works
just fine for fixed values, but looses information and fails assertions
if the TypeSize was scalable. This change breaks the first link in that
implicit conversion chain since that seemed to be the easier one.
In [1], Nikita Popov suggested that during lowering 'unreachable' insn
should not generate extra code for naked functions, and this applies to
all architectures. Note that for naked functions, 'unreachable' insn is
necessary in IR since the basic block needs a terminator to end.
This patch checked whether a function is naked function or not. If it is
a naked function, 'unreachable' insn will not generate ISD::TRAP.
[1] https://github.com/llvm/llvm-project/pull/131731
Co-authored-by: Yonghong Song <yonghong.song@linux.dev>
The llvm.amdgcn.cs.chain intrinsic has a 'flags' operand which may
indicate that we want to reallocate the VGPRs before performing the
call.
A call with the following arguments:
```
llvm.amdgcn.cs.chain %callee, %exec, %sgpr_args, %vgpr_args,
/*flags*/0x1, %num_vgprs, %fallback_exec, %fallback_callee
```
is supposed to do the following:
- copy the SGPR and VGPR args into their respective registers
- try to change the VGPR allocation
- if the allocation has succeeded, set EXEC to %exec and jump to
%callee, otherwise set EXEC to %fallback_exec and jump to
%fallback_callee
This patch implements the dynamic VGPR behaviour by generating an
S_ALLOC_VGPR followed by S_CSELECT_B32/64 instructions for the EXEC and
callee. The rest of the call sequence is left undisturbed (i.e.
identical to the case where the flags are 0 and we don't use dynamic
VGPRs). We achieve this by introducing some new pseudos
(SI_CS_CHAIN_TC_Wn_DVGPR) which are expanded in the SILateBranchLowering
pass, just like the simpler SI_CS_CHAIN_TC_Wn pseudos. The main reason
is so that we don't risk other passes (particularly the PostRA
scheduler) introducing instructions between the S_ALLOC_VGPR and the
jump. Such instructions might end up using VGPRs that have been
deallocated, or the wrong EXEC mask. Once the whole backend treats
S_ALLOC_VGPR and changes to EXEC as barriers for instructions that use
VGPRs, we could in principle move the expansion earlier (but in the
absence of a good reason for that my personal preference is to keep it
later in order to make debugging easier).
Since the expansion happens after register allocation, we're careful to
select constants to immediate operands instead of letting ISel generate
S_MOVs which could interfere with register allocation (i.e. make it look
like we need more registers than we actually do).
For GFX12, S_ALLOC_VGPR only works in wave32 mode, so we bail out during
ISel in wave64 mode. However, we can define the pseudos for wave64 too
so it's easy to handle if future generations support it.
---------
Co-authored-by: Ana Mihajlovic <Ana.Mihajlovic@amd.com>
Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com>
Unlike in Itanium EH IR, WinEH IR's unwinding instructions (e.g.
`invoke`s) can have multiple possible unwind destinations.
For example:
```ll
entry:
invoke void @foo()
to label %cont unwind label %catch.dispatch
catch.dispatch: ; preds = %entry
%0 = catchswitch within none [label %catch.start] unwind label %terminate
catch.start: ; preds = %catch.dispatch
%1 = catchpad within %0 [ptr null]
...
terminate: ; preds = %catch.dispatch
%2 = catchpad within none []
...
...
```
In this case, if an exception is not caught by `catch.dispatch` (and
thus `catch.start`), it should next unwind to `terminate`.
`findUnwindDestination` in ISel gathers the list of this unwind
destinations traversing the unwind edges:
ae42f07103/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp (L2089-L2150)
But we don't use that, and instead use our custom
`findWasmUnwindDestinations` that only adds the first unwind
destination, `catch.start`, to the successor list of `entry`, and not
`terminate`:
ae42f07103/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp (L2037-L2087)
The reason behind it was, as described in the comment block in the code,
it was assumed that there always would be an `invoke` that connects
`catch.start` and `terminate`. In case of `catch (type)`, there will be
`call void @llvm.wasm.rethrow()` in `catch.start`'s predecessor that
unwinds to the next destination. For example:
0db702ac8e/llvm/test/CodeGen/WebAssembly/exception.ll (L429-L430)
In case of `catch (...)`, `__cxa_end_catch` can throw, so it becomes an
`invoke` that unwinds to the next destination. For example:
0db702ac8e/llvm/test/CodeGen/WebAssembly/exception.ll (L537-L538)
So the unwind ordering relationship between `catch.start` and
`terminate` here would be preserved.
But turns out this assumption does not always hold. For example:
```ll
entry:
invoke void @foo()
to label %cont unwind label %catch.dispatch
catch.dispatch: ; preds = %entry
%0 = catchswitch within none [label %catch.start] unwind label %terminate
catch.start: ; preds = %catch.dispatch
%1 = catchpad within %0 [ptr null]
...
call void @_ZSt9terminatev()
unreachable
terminate: ; preds = %catch.dispatch
%2 = catchpad within none []
call void @_ZSt9terminatev()
unreachable
...
```
In this case there is no `invoke` that connects `catch.start` to
`terminate`. So after `catch.dispatch` BB is removed in ISel,
`terminate` is considered unreachable and incorrectly removed in DCE.
This makes Wasm just use the general `findUnwindDestination`. In that
case `entry`'s successor is going to be [`catch.start`, `terminate`]. We
can get the first unwind destination by just traversing the list from
the front.
---
This required another change in WinEHPrepare. WinEHPrepare demotes all
PHIs in EH pads because they are funclets in Windows and funclets can't
have PHIs. When used in Wasm they are not funclets so we don't need to
do that wholesale but we still need to demote PHIs in `catchswitch` BBs
because they are deleted during ISel. (So we created
[`-demote-catchswitch-only`](a5588b6d20/llvm/lib/CodeGen/WinEHPrepare.cpp (L57-L59))
option for that)
But turns out we need to remove PHIs that have a `catchswitch` BB as an
incoming block too:
```ll
...
catch.dispatch:
%0 = catchswitch within none [label %catch.start] unwind label %terminate
catch.start:
...
somebb:
...
ehcleanup ; preds = %catch.dispatch, %somebb
%1 = phi i32 [ 10, %catch.dispatch ], [ 20, %somebb ]
...
```
In this case the `phi` in `ehcleanup` BB should be demoted too because
`catch.dispatch` BB will be removed in ISel so one if its incoming block
will be gone. This pattern didn't manifest before presumably due to how
`findWasmUnwindDestinations` worked. (In this example, in our
`findWasmUnwindDestinations`, `catch.dispatch` would have had only one
successor, `catch.start`. But now `catch.dispatch` has both
`catch.start` and `ehcleanup` as successors, revealing this bug. This
case is
[represented](ab87206c4b/llvm/test/CodeGen/WebAssembly/exception.ll (L445))
by `rethrow_terminator` function in `exception.ll` (or
`exception-legacy.ll`) and without the WinEHPrepare fix it will crash.
---
Discovered by the reproducer provided in #126916, even though the bug
reported there was not this one.