This is to follow the discussion in
https://github.com/llvm/llvm-project/pull/164565
CallBase can cover more call-like instructions which carry caling
convention flag.
Co-authored-by: Yuanke Luo <ykluo@birentech.com>
Selection DAG has a more sophisticated execution order representation
than the simple sequence used in IR, so building the DAG can take into
account specific properties of the nodes to better express possible
parallelism. The existing implementation does this for constrained
function calls, some of them are considered as independent, which can
potentially improve the generated code. However this mechanism
incorrectly implies that the calls with exception behavior 'ebIgnore'
cannot raise floating-point exception. The purpose of this change is to
fix the implementation.
In the current implementation, constrained function calls don't
immediately update the DAG root. Instead, the DAG builder collects their
output chains and flushes them when the root is required. Constrained
function calls cannot be moved across calls of external functions and
intrinsics that access floating-point environment, they work as
barriers. Between the barriers, constrained function calls can be
reordered, they may be considered independent from viewpoint of raising
exceptions. For strictfp functions this is possible only if
floating-point trapping is disabled.
This change introduces a new restriction - the calls with default
exception handling cannot not be moved between strictfp function calls.
Otherwise the exceptions raised by such call can disturb the expected
exception sequence. It means that constrained function calls with strict
exception behavior act as barriers for the calls with non-strict
behavior and vice versa. Effectively it means that the entire sequence
of constrained calls in IR is split into "strict" and "non-strict"
regions, in which restrictions on the order of constrained calls are
relaxed, but move from one region to another is not allowed. It agrees
with the representation of strictfp code in high-level languages. For
example, C/C++ strictfp code correspond to blocks where pragma `STDC
FENV_ACCESS ON` is in effect, this restriction should help preserving
the intended semantics.
When floating-point exception trapping is enabled, constrained
intrinsics with 'ebStrict' cannot be reordered, their sequence must be
identical to the original source order. The current implementation does
not distinguish between strictfp modes with trapping and without it.
This change make assumption that the trapping is disabled. It is not
correct in the general case, but is compatible with the existing
implementation.
This patch introduces SDNodeFlags::InBounds, to show that an ISD::PTRADD SDNode
implements an inbounds getelementptr operation (i.e., the pointer operand is in
bounds wrt. an allocated object it is based on, and the arithmetic does not
change that). The flag is set in the DAG construction when lowering inbounds
GEPs.
Inbounds information is useful in the ISel when selecting memory instructions
that perform address computations whose intermediate steps must be in the same
memory region as the final result. Follow-up patches to propagate the flag in
DAGCombines and to use it when lowering AMDGPU's flat memory instructions,
where the immediate offset must not affect the memory aperture of the address
(similar to this GISel patch: #153001), are planned.
This mirrors #150900, which has introduced a similar flag in GlobalISel.
This patch supersedes #131862, which previously attempted to introduce an
SDNodeFlags::InBounds flag. The difference between this PR and #131862 is that
there is now an ISD::PTRADD opcode (PR #140017) and the InBounds flag is only
defined to apply to ISD::PTRADD DAG nodes. It is therefore unambiguous that
in-bounds-ness refers to a memory object into which the left operand of the
PTRADD node points (in contrast to #131862, where InBounds would have applied
to commutative ISD::ADD nodes, so that the semantics would be more difficult to
reason about).
For SWDEV-516125.
When switch from fast isel to dag isel the input value is from llvm IR
instruction.
If the instruction is call we should get the calling convention of the
callee and
pass it to RegsForValue::getCopyFromRegs, so that it can deduce the
right RegisterVT
of the returned value of the callee.
---------
Co-authored-by: Yuanke Luo <ykluo@birentech.com>
The `masked.load`, `masked.store`, `masked.gather` and `masked.scatter`
intrinsics currently accept a separate alignment immarg. Replace this
with an `align` attribute on the pointer / vector of pointers argument.
This is the standard representation for alignment information on
intrinsics, and is already used by all other memory intrinsics. This
means the signatures now match llvm.expandload, llvm.vp.load, etc.
(Things like llvm.memcpy used to have a separate alignment argument as
well, but were already migrated a long time ago.)
It's worth noting that the masked.gather and masked.scatter intrinsics
previously accepted a zero alignment to indicate the ABI type alignment
of the element type. This special case is gone now: If the align
attribute is omitted, the implied alignment is 1, as usual. If ABI
alignment is desired, it needs to be explicitly emitted (which the
IRBuilder API already requires anyway).
Instead of just deferring to ptrtoint, we should truncate to the index
width and then perform the ZextOrTrunc.
This is effectively NFC since ptrtoint ends up doing the same thing, but
handling it explicitly is cleaner and will make it easier to eventually
upstream the changes needed for CHERI support.
Reviewed By: nikic, arsenm
Pull Request: https://github.com/llvm/llvm-project/pull/139423
Reland #161355, after fixing up the cross-projects-tests for the wasm
simd intrinsics.
Original commit message:
Lower v4f32 and v2f64 fmuladd calls to relaxed_madd instructions.
If we have FP16, then lower v8f16 fmuladds to FMA.
I've introduced an ISD node for fmuladd to maintain the rounding
ambiguity through legalization / combine / isel.
Lower v4f32 and v2f64 fmuladd calls to relaxed_madd instructions.
If we have FP16, then lower v8f16 fmuladds to FMA.
I've introduced an ISD node for fmuladd to maintain the rounding
ambiguity through legalization / combine / isel.
AIX has "millicode" routines, which are functions loaded at boot time
into fixed addresses in kernel memory. This allows them to be customized
for the processor. The __strlen routine is a millicode implementation;
we use millicode for the strlen function instead of a library call to
improve performance.
The PowerPC changes are caused by shifts created by different IR
operations being CSEd now. This allows consecutive loads to be turned
into vectors earlier. This has effects on the ordering of other combines
and legalizations. This leads to some improvements and some regressions.
Support tail calls to whole wave functions (trivial) and from whole wave
functions (slightly more involved because we need a new pseudo for the
tail call return, that patches up the EXEC mask).
Move the expansion of whole wave function return pseudos (regular and
tail call returns) to prolog epilog insertion, since that's where we
patch up the EXEC mask.
It can be unsafe to load a vector from an address and write a vector to
an address if those two addresses have overlapping lanes within a
vectorised loop iteration.
This PR adds intrinsics designed to create a mask with lanes disabled if
they overlap between the two pointer arguments, so that only safe lanes
are loaded, operated on and stored. The `loop.dependence.war.mask`
intrinsic represents cases where the store occurs after the load, and
the opposite for `loop.dependence.raw.mask`. The distinction between
write-after-read and read-after-write is important, since the ordering
of the read and write operations affects if the chain of those
instructions can be done safely.
Along with the two pointer parameters, the intrinsics also take an
immediate that represents the size in bytes of the vector element types.
This will be used by #100579.
We just replaced SmallSet<T *, N> with SmallPtrSet<T *, N>, bypassing
the redirection found in SmallSet.h. With that, we no longer need to
include SmallSet.h in many files.
This patch replaces SmallSet<T *, N> with SmallPtrSet<T *, N>. Note
that SmallSet.h "redirects" SmallSet to SmallPtrSet for pointer
element types:
template <typename PointeeType, unsigned N>
class SmallSet<PointeeType*, N> : public SmallPtrSet<PointeeType*, N>
{};
We only have 140 instances that rely on this "redirection", with the
vast majority of them under llvm/. Since relying on the redirection
doesn't improve readability, this patch replaces SmallSet with
SmallPtrSet for pointer element types.
Mips requires fp128 args/returns to be passed differently than i128. It
handles this by inspecting the pre-legalization type. However, for soft
float libcalls, the original type is currently not provided (it will
look like a i128 call). To work around that, MIPS maintains a list of
libcalls working on fp128.
This patch removes that list by providing the original, pre-softening
type to calling convention lowering. This is done by carrying additional
information in CallLoweringInfo, as we unfortunately do need both types
(we want the un-softened type for OrigTy, but we need the softened type
for the actual register assignment etc.)
This is in preparation for completely removing all the custom
pre-analysis code in the Mips backend and replacing it with use of
OrigTy.
This reverts commit 14cd1339318b16e08c1363ec6896bd7d1e4ae281. The
buildbot failure seems to have been a cmake issue which has been
discussed in more detail in this Discourse post:
https://discourse.llvm.org/t/cmake-doesnt-regenerate-all-tablegen-target-files/87901
If any buildbots fail to select arbitrary intrinsics with this patch,
it's worth considering using clean builds with ccache instead of
incremental builds, as recommended here:
https://llvm.org/docs/HowToAddABuilder.html#:~:text=Use%20CCache%20and%20NOT%20incremental%20builds
The original commit message for this patch:
Add the llvm.amdgcn.call.whole.wave intrinsic for calling whole wave
functions. This will take as its first argument the callee with the
amdgpu_gfx_whole_wave calling convention, followed by the call
parameters which must match the signature of the callee except for the
first function argument (the i1 original EXEC mask, which doesn't need
to be passed in). Indirect calls are not allowed.
Make direct calls to amdgpu_gfx_whole_wave functions a verifier error.
Tail calls are handled in a future patch.
https://github.com/llvm/llvm-project/pull/152709 exposed the original IR
argument type to the CC lowering logic. However, in SDAG, this used the
raw type, prior to aggregate splitting. This PR changes it to use the
non-aggregate type instead. (This matches what happened in the
GlobalISel case already.)
I've also added some more detailed documentation on the
InputArg/OutputArg fields, to explain how they differ. In most cases
ArgVT is going to be the EVT of OrigTy, so they encode very similar
information (OrigTy just preserves some additional information lost in
EVTs, like pointer types). One case where they do differ is in
post-legalization lowering of libcalls, where ArgVT is going to be a
legalized type, while OrigTy is going to be the original non-legalized
type.
Partially fix#149023.
The original code `MRI.def_begin(Reg)->getParent()` may return the
incorrect MI, as the physical register `Reg` may have multiple
definitions.
This patch selects the correct MI to verify by comparing the MBB of each
definition.
New testcase hangs with -O1/2/3 enabled. The BranchFolding may be to
blame.
It is common to have ABI requirements for illegal types: For example,
two i64 argument parts that originally came from an fp128 argument may
have a different call ABI than ones that came from a i128 argument.
The current calling convention lowering does not provide access to this
information, so backends come up with various hacks to support it (like
additional pre-analysis cached in CCState, or bypassing the default
logic entirely).
This PR adds the original IR type to InputArg/OutputArg and passes it
down to CCAssignFn. It is not actually used anywhere yet, this just does
the mechanical changes to thread through the new argument.
This introduces a new `ptrtoaddr` instruction which is similar to
`ptrtoint` but has two differences:
1) Unlike `ptrtoint`, `ptrtoaddr` does not capture provenance
2) `ptrtoaddr` only extracts (and then extends/truncates) the low
index-width bits of the pointer
For most architectures, difference 2) does not matter since index (address)
width and pointer representation width are the same, but this does make a
difference for architectures that have pointers that aren't just plain
integer addresses such as AMDGPU fat pointers or CHERI capabilities.
This commit introduces textual and bitcode IR support as well as basic code
generation, but optimization passes do not handle the new instruction yet
so it may result in worse code than using ptrtoint. Follow-up changes will
update capture tracking, etc. for the new instruction.
RFC: https://discourse.llvm.org/t/clarifiying-the-semantics-of-ptrtoint/83987/54
Reviewed By: nikic
Pull Request: https://github.com/llvm/llvm-project/pull/139357
Now that #149310 has restricted lifetime intrinsics to only work on
allocas, we can also drop the explicit size argument. Instead, the size
is implied by the alloca.
This removes the ability to only mark a prefix of an alloca alive/dead.
We never used that capability, so we should remove the need to handle
that possibility everywhere (though many key places, including stack
coloring, did not actually respect this).
AIX has "millicode" routines, which are functions loaded at boot time
into fixed addresses in kernel memory. This allows them to be customized
for the processor. The __memcmp routine is a millicode implementation;
we use millicode for the memcmp function instead of a library call to
improve performance.
The information whether a specific argument is vararg or fixed is
currently stored separately from all the other argument information in
ArgFlags. This means that it is not accessible from CCAssign, and
backends have developed all kinds of workarounds for how they can access
it after all.
Move this information to ArgFlags to make it directly available in all
relevant places.
I've opted to invert this and store it as IsVarArg, as I think that both
makes the meaning more obvious and provides for a better default (which
is IsVarArg=false).
Add the llvm.amdgcn.call.whole.wave intrinsic for calling whole wave
functions. This will take as its first argument the callee with the
amdgpu_gfx_whole_wave calling convention, followed by the call
parameters which must match the signature of the callee except for the
first function argument (the i1 original EXEC mask, which doesn't need
to be passed in). Indirect calls are not allowed.
Make direct calls to amdgpu_gfx_whole_wave functions a verifier error.
Unspeakable horrors happen around calls from whole wave functions, the
plan is to improve the handling of caller/callee-saved registers in
a future patch.
Tail calls are also handled in a future patch.
This slightly relaxes the invariant established in #149310, by also
allowing the lifetime argument to be poison. This is to support the
typical pattern of RAUWing with poison when removing an instruction.
It's worth noting that this does not require any conservative
assumptions, lifetimes with poison arguments can simply be skipped.
Fixes https://github.com/llvm/llvm-project/issues/151119.
Split out from https://github.com/llvm/llvm-project/pull/150248:
Specify that the argument of lifetime.start/lifetime.end is ignored and
will be removed in the future.
Remove lifetime size handling from SDAG. The size was previously
discarded during isel, so was always ignored for stack coloring anyway.
Where necessary, obtain the size of the full frame index.
After https://github.com/llvm/llvm-project/pull/149310 we are guaranteed
that the argument is an alloca, so we don't need to look at underlying
objects (which was not a correct thing to do anyway).
This also drops the offset argument for lifetime nodes in SDAG. The
offset is fixed to zero now. (Peculiarly, while SDAG pretended to have
an offset, it just gets silently dropped during selection.)
When generating SDAG for a getelementptr with a vector result, we were
previously generating splats for each scalar operand. This essentially
has the effect of aggressively vectorizing the sequence, and leaving it
later combines to scalarize if profitable.
Instead, we can keep the accumulating address as a scalar for as long as
the prefix of operands allows before lazily converting to vector on the
first vector operand. This both better fits hardware which frequently
has a scalar base on the scatter/gather instructions, and reduces the
addressing cost even when not as otherwise we end up with a scalar to
vector domain crossing for each scalar operand.
Note that constant splat offsets are treated as scalar for the above,
and only variable offsets can force a conversion to vector.
---------
Co-authored-by: Craig Topper <craig.topper@sifive.com>
Also fix the LangRef to match the implementation. This was checking
against the alloca address space size rather than the default address
space.
The check was also more permissive than the LangRef. The error
check permitted any size less than the pointer size; follow the
stricter wording of the LangRef.
Seeing how we can't generate any debug intrinsics any more: delete a
variety of codepaths where they're handled. For the most part these are
plain deletions, in others I've tweaked comments to remain coherent, or
added a type to (what was) type-generic-lambdas.
This isn't all the DbgInfoIntrinsic call sites but it's most of the
simple scenarios.
Co-authored-by: Nikita Popov <github@npopov.com>