V3 has been deprecated for a while as well, so it can safely be removed
like V2 was removed.
- [Clang] Set minimum code object version to 4
- [lld] Fix tests using code object v3
- Remove code object V3 from the AMDGPU backend, and delete or port v3
tests to v4.
- Update docs to make it clear V3 can no longer be emitted.
The primary ISA-independent justification for using PC-relative
addressing is that it makes code position-independent and therefore
allows sharing of .text pages between processes.
When not sharing .text pages, we can use absolute relocations instead,
which will possibly prevent a bubble introduced by s_getpc_b64.
Co-authored-by: Thomas Symalla <thomas.symalla@amd.com>
Make codegen emit correctly rounded sqrt by default.
Emit the fast but only kind of fast expansion in AMDGPUCodeGenPrepare
based on !fpmath, like the fdiv case. Hack around visitation ordering
problems from AMDGPUCodeGenPrepare using forward iteration instead of
a well behaved combiner.
https://reviews.llvm.org/D158129
Mirror of the previous log changes, OpenCL conformance doesn't like
interpreting afn as ignore denormal handling but was previously hidden
by flag dropping.
Apparently afn doesn't allow you to drop the denormal handling
according to OpenCL conformance. This was hidden by losing the flags
during the library linking process. Fast log is still broken and needs
more work.
https://reviews.llvm.org/D157936
The 16-bit VAddr arguments to A16 image instructions are packed into
legal VGPR_32 operands in AMDGPULegalizerInfo::legalizeImageIntrinsic on
all subtargets. With True16, we also need to pack if the number of VAddr is one
because VGPR_16 is not a legal argument to those Image instructions.
No change to emitted code intended on subtargets pre-GFX11, and none on GFX11
until True16 is active.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D157426
There is no need to increase the size of odd sized vectors if they are
going to be scalarized by a different rule.
Patch by: Acim Maravic
Differential Revision: https://reviews.llvm.org/D155865
Introduced the convergent equivalent of the existing G_INTRINSIC opcodes:
- G_INTRINSIC_CONVERGENT
- G_INTRINSIC_CONVERGENT_W_SIDE_EFFECTS
Out of the targets that currently have some support for GlobalISel, the patch
assumes that the convergent intrinsics only relevant to SPIRV and AMDGPU.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D154766
Some opcodes in generic MIR represent calls to intrinsics, where the intrinsic
ID is the first non-def operand to the instruction. These are now represented as
a subclass of GenericMachineInstr, and the method MachineInstr::getIntrinsicID()
is now moved to this subclass GIntrinsic.
Some target-defined instructions behave like GMIR intrinsics, and have an
Intrinsic::ID operand. But they should not be recognized as generic intrinsics,
and should not use GIntrinsic::getIntrinsicID(). Separated these out by
introducing a new AMDGPU::getIntrinsicID().
Reviewed By: arsenm, Pierre-vh
Differential Revision: https://reviews.llvm.org/D155556
This restores commit baa3386edb11a2f9bcadda8cf58d56f3707c39fa.
Originally reverted in d0f7850b01cf17e50a4f4b00e3b84dded94df6b8.
This reverts commit baa3386edb11a2f9bcadda8cf58d56f3707c39fa.
The changes did not cover all occurrences of the deteleted function
MachineInstr::getIntrinsicID().
Some opcodes in generic MIR represent calls to intrinsics, where the intrinsic
ID is the first non-def operand to the instruction. These are now represented as
a subclass of GenericMachineInstr, and the method MachineInstr::getIntrinsicID()
is now moved to this subclass GIntrinsic.
Some target-defined instructions behave like GMIR intrinsics, and have an
Intrinsic::ID operand. But they should not be recognized as generic intrinsics,
and should not use GIntrinsic::getIntrinsicID(). Separated these out by
introducing a new AMDGPU::getIntrinsicID().
Reviewed By: arsenm, Pierre-vh
Differential Revision: https://reviews.llvm.org/D155556
rocm-device-libs and llpc were avoiding using f64 sqrt
intrinsics in favor of their own expansions. Port the
expansion into the backend. Both of these users should be
updated to call the intrinsic instead.
The library and llpc expansions are slightly different.
llpc uses an ldexp to do the scale; the library uses a multiply.
Use ldexp to do the scale instead of the multiply.
I believe v_ldexp_f64 and v_mul_f64 are always the same number of
cycles, but it's cheaper to materialize the 32-bit integer constant
than the 64-bit double constant.
The libraries have another fast version of sqrt which will
be handled separately.
I am tempted to do this in an IR expansion instead. In the IR
we could take advantage of computeKnownFPClass to avoid
the 0-or-inf argument check.
The most notable issue was producing v_mad_f32 in functions with the
dynamic mode, since it just ignores the mode. fdiv lowering is still
somewhat broken because it involves a mode switch and we need to query
the original mode.
The library expansion has too many paths for all the permutations of
DAZ, unsafe and the 3 exp functions. It's easier to expand it in the
backend when we know all of these things. The library currently misses
the no-infinity check on the overflow, which this handles optimizing
out.
Some of the <3 x half> fast tests regress due to vector widening
dropping flags which will be fixed separately.
Apparently there is no exp10 intrinsic, but there should be. Adds some
deadish code in preparation for adding one while I'm following along
with the current library expansion.
These inherited the fast math checks from f32, but the manual suggests
these should be accurate enough for unconditional use. The definition
of correctly rounded is 0.5ulp, but the manual says "0.51ulp". I've
been a bit nervous about changing this as the OpenCL conformance test
does not cover half. Brute force produces identical values compared to
a reference host implementation for all values.
Previously we expanded these in a fast-math way and the device
libraries were relying on this behavior. The libraries have a pending
change to switch to the new target intrinsic.
Unlike the library version, this takes advantage of no-infinities on
the result overflow check.
Add an intrinsic which returns the two pieces as multiple return
values. Alternatively could introduce a pair of intrinsics to
separately return the fractional and exponent parts.
AMDGPU has native instructions to return the two halves, but could use
some generic legalization and optimization handling. For example, we
should be able to handle legalization of f16 on older targets, and for
bf16. Additionally antique targets need a hardware workaround which
would be better handled in the backend rather than in library code
where it is now.
We previously directly codegened to v_log_f32, which is broken for
denormals. The lowering isn't complicated, you simply need to scale
denormal inputs and adjust the result. Note log and log10 are still
not accurate enough, and will be fixed separately.
The division between float constants was done with less
precision. Performing the divide in double and truncating to float
provides the same value as used in the library fast math expansion.
This was inserting an s_endpgm in the middle of the block when it has
to be a terminator. Split the block and insert a branch to a new block
with the trap if it's not in a terminator position.
Fixes verifier error on LDS in function with no trap support (and
other trap sources).
AMDGPU has native instructions and target intrinsics for this, but
these really should be subject to legalization and generic
optimizations. This will enable legalization of f16->f32 on targets
without f16 support.
Implement a somewhat horrible inline expansion for targets without
libcall support. This could be better if we could introduce control
flow (GlobalISel version not yet implemented). Support for strictfp
legalization is less complete but works for the simple cases.
Define the function @llvm.amdgcn.make.buffer.rsrc, which take a 64-bit
pointer, the 16-bit stride/swizzling constant that replace the high 16
bits of an address in a buffer resource, the 32-bit extent/number of
elements, and the 32-bit flags (the latter two being the 3rd and 4th
wards of the resource), and combines them into a ptr addrspace(8).
This intrinsic is lowered during the early phases of the backend.
This intrinsic is needed so that alias analysis can correctly infer
that a certain buffer resource points to the same memory as some
global pointer. Previous methods of constructing buffer resources,
which relied on ptrtoint, would not allow for such an inference.
Depends on D148184
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D148957
1. Remove the existing code that would encode the constant offsets (if
there were any) on buffer intrinsic operations onto their
`MachineMemOperand`s. As far as I can tell, this use of `offset` has
no substantial impact on the generated code, especially since the same
reasoning is performed by areMemAccessesTriviallyDisjoint().
2. When a buffer resource intrinsic takes a pointer argument as the
base resource/descriptor, place that memory argument in the value
field of the MachineMemOperand attached to that intrinsic.
This is more conservative than what would be produced by more typical
LLVM code using GEP, as the Value (for alias analysis purposes)
corresponding to accessing buffer[0] and buffer[1] is the same.
However, the target-specific analysis of disjoint offsets covers a lot
of the simple usecases.
Despite this limitation, the new buffer intrinsics, combined with
LLVM's existing pointer annotations, allow for non-trivial
optimizations, as seen in the new tests, where marking two buffer
descriptors "noalias" allows merging together loads and stores in a
"load from A, modify loaded value, store to B" sequence, which would
not be possible previously.
Depends on D147547
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D148184
In order to enable the LLVM frontend to better analyze buffer
operations (and to potentially enable more precise analyses on the
backend), define versions of the raw and structured buffer intrinsics
that use `ptr addrspace(8)` instead of `<4 x i32>` to represent their
rsrc arguments.
The new intrinsics are named by replacing `buffer.` with `buffer.ptr`.
One advantage to these intrinsic definitions is that, instead of
specifying that a buffer load/store will read/write some memory, we
can indicate that the memory read or written will be based on the
pointer argument. This means that, for example, a read from a
`noalias` buffer can be pulled out of a loop that is modifying a
distinct buffer.
In the future, we will define custom PseudoSourceValues that will
allow us to package up the (buffer, index, offset) triples that buffer
intrinsics contain and allow for more precise backend analysis.
This work also enables creating address space 7, which represents
manipulation of raw buffers using native LLVM load and store
instructions.
Where tests simply used a buffer intrinsic while testing some other
code path (such as the tests for VGPR spills), they have been updated
to use the new intrinsic form. Tests that are "about" buffer
intrinsics (for instance, those that ensure that they codegen as
expected) have been duplicated, either within existing files or into
new ones.
Depends on D145441
Reviewed By: arsenm, #amdgpu
Differential Revision: https://reviews.llvm.org/D147547
If we have legal f16 instructions but no f16 med3, we can save
one instruction by expanding out the min/max sequence compared
to casting to f32 and casting back.
Instead of checking if the given bitwidth is less or equal to a bitwidth of an existing RegClass,
check if it has the exact same value.
For LLVM vector types that don't have a corresponding Register Class, widen them during legalization.
That goes for G_EXTRACT_VECTOR_ELT, G_INSERT_VECTOR_ELT and G_BUILD_VECTOR.
Differential revision: https://reviews.llvm.org/D148096
Reviewers: foad, arsenm
Summary:
This is to avoid using the callee saved registers for the return address
of the tail call return instruction.
Reviewers:
arsenm, cdevadas
Differential Revision:
https://reviews.llvm.org/D147096
The premise here is to allow non-kernel functions to locate external LDS variables without using LDS or extra magic SGPRs to do so.
1/ First it crawls the callgraph to work out which external LDS variables are reachable from a given kernel
2/ Then it creates a new `extern char[0]` variable for each kernel, which will alias all the other extern LDS variables because that's the documented behaviour of these variables
3/ The address of that variable is written to a lookup table. The global variable is tagged with metadata to track what address it was allocated at by codegen
4/ The assembler builds the lookup table using the metadata
5/ Any non-kernel functions use the same magic intrinsic used by table lookups of non-dynamic LDS variables to find the address to use
Heavy overlap with the code paths taken for other lowering, in particular the same intrinsic is used to pass the dynamic scope information through the same sgpr as for table lookups of static LDS.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D144233
This is only used by CodeGen. Moving it out of AMDGPUBaseInfo simplifies
future changes to make some of it depend on the subtarget.
Differential Revision: https://reviews.llvm.org/D144650