629 Commits

Author SHA1 Message Date
pvanhout
868abf0961 Revert "[AMDGPU] Remove Code Object V3 (#67118)"
This reverts commit 544d91280c26fd5f7acd70eac4d667863562f4cc.
2023-10-18 12:55:36 +02:00
Jay Foad
bff17f9f23
[AMDGPU] Remove support for no-return buffer atomic intrinsics. NFC. (#69326)
Thsi removes some of the machinery added by D85268, which was unused
since D87719 changed all buffer atomic intrinsics to return a value.
2023-10-17 14:47:32 +01:00
Pierre van Houtryve
544d91280c
[AMDGPU] Remove Code Object V3 (#67118)
V3 has been deprecated for a while as well, so it can safely be removed
like V2 was removed.

- [Clang] Set minimum code object version to 4
- [lld] Fix tests using code object v3
- Remove code object V3 from the AMDGPU backend, and delete or port v3
tests to v4.
- Update docs to make it clear V3 can no longer be emitted.
2023-10-16 08:21:48 +02:00
Jay Foad
b49a0dbaeb
[AMDGPU] Fix comments about afn and arcp in fast unsafe fdiv handling (#68982) 2023-10-13 19:23:53 +01:00
Thomas Symalla
aa5158cd1e
[AMDGPU] Use absolute relocations when compiling for AMDPAL and Mesa3D (#67791)
The primary ISA-independent justification for using PC-relative
addressing is that it makes code position-independent and therefore
allows sharing of .text pages between processes.

When not sharing .text pages, we can use absolute relocations instead,
which will possibly prevent a bubble introduced by s_getpc_b64.

Co-authored-by: Thomas Symalla <thomas.symalla@amd.com>
2023-10-10 09:22:02 +02:00
Mirko Brkušanin
ecfdc23dd2
[AMDGPU] Select gfx1150 SALU Float instructions (#66885) 2023-09-21 12:22:55 +02:00
Pierre van Houtryve
e9e3868707
[AMDGPU] Correctly restore FP mode in FDIV32 lowering (#66346)
Addresses the FIXME for both DAGISel and GISel.
2023-09-15 08:11:01 +02:00
Pierre van Houtryve
3d0353793b
[AMDGPU] Fix HasFP32Denormals check in FDIV32 lowering (#66212)
Fixes SWDEV-403219
2023-09-14 08:47:10 +02:00
Matt Arsenault
c48248d7f9 AMDGPU: Teach valueIsKnownNeverF32Denorm about frexp
https://reviews.llvm.org/D158130
2023-09-12 23:23:10 +03:00
Matt Arsenault
72a7024add AMDGPU: Correctly lower llvm.sqrt.f32
Make codegen emit correctly rounded sqrt by default.

Emit the fast but only kind of fast expansion in AMDGPUCodeGenPrepare
based on !fpmath, like the fdiv case. Hack around visitation ordering
problems from AMDGPUCodeGenPrepare using forward iteration instead of
a well behaved combiner.

https://reviews.llvm.org/D158129
2023-09-12 23:22:54 +03:00
Stanislav Mekhanoshin
070c2570ad
[AMDGPU] Global ISel for packed fp32 instructions (#65803) 2023-09-11 10:48:37 -07:00
Matt Arsenault
77c67436d9 LLT: Add some stub constructors for FP types
This is to start documenting uses to ease a future migration
to supporting different types with the same size.

https://reviews.llvm.org/D150605
2023-09-03 08:33:19 -04:00
Martin
432eda5cc5 Remove dead assignments
Dst is never used after creating the type making
these assignments dead

Reviewed by: arsenm
Differential Revision: https://reviews.llvm.org/D158610
2023-08-24 12:30:27 +01:00
Matt Arsenault
c8eeee2be2 AMDGPU: Drop unsafe 1/sqrt -> rsq combine
AMDGPUCodeGenPrepare implements a safer version of this that handles
denormals correctly.

https://reviews.llvm.org/D158032
2023-08-16 08:52:17 -04:00
Matt Arsenault
81b278e613 AMDGPU: Fix fast f32 exp2
Mirror of the previous log changes, OpenCL conformance doesn't like
interpreting afn as ignore denormal handling but was previously hidden
by flag dropping.
2023-08-15 10:48:46 -04:00
Matt Arsenault
4b7b4b9458 AMDGPU: Fix fast f32 log/log10
OpenCL conformance didn't like interpreting afn as ignore the denormal
handling.

https://reviews.llvm.org/D157940
2023-08-15 10:48:46 -04:00
Matt Arsenault
e09b3593ba AMDGPU: Fix fast math log2 f32
Apparently afn doesn't allow you to drop the denormal handling
according to OpenCL conformance. This was hidden by losing the flags
during the library linking process. Fast log is still broken and needs
more work.

https://reviews.llvm.org/D157936
2023-08-15 10:48:46 -04:00
Matt Arsenault
1faa4797ca AMDGPU: Handle unsafe exp.f32 with denormal handling
I somehow missed this path when adding the new expansions. Saves a lot
of instructions for afn + IEEE.

https://reviews.llvm.org/D157867
2023-08-14 18:36:01 -04:00
Joe Nash
2fb4bfa5ba [AMDGPU][True16] Fix ISel for A16 Image Instructions
The 16-bit VAddr arguments to A16 image instructions are packed into
legal VGPR_32 operands in AMDGPULegalizerInfo::legalizeImageIntrinsic on
all subtargets. With True16, we also need to pack if the number of VAddr is one
because VGPR_16 is not a legal argument to those Image instructions.

No change to emitted code intended on subtargets pre-GFX11, and none on GFX11
until True16 is active.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D157426
2023-08-11 11:12:16 -04:00
Matt Arsenault
1030483561 AMDGPU/GlobalISel: Handle stacksave/stackrestore
https://reviews.llvm.org/D156670
2023-08-11 10:25:01 -04:00
Mirko Brkusanin
fadf3e7f2b [AMDGPU][GlobalISel] Update legalizer for G_ABS, G_SMIN, G_SMAX, G_UMIN, G_UMAX
There is no need to increase the size of odd sized vectors if they are
going to be scalarized by a different rule.

Patch by: Acim Maravic

Differential Revision: https://reviews.llvm.org/D155865
2023-08-02 12:18:18 +02:00
Sameer Sahasrabuddhe
d9847cde48 [GlobalISel] convergent intrinsics
Introduced the convergent equivalent of the existing G_INTRINSIC opcodes:

- G_INTRINSIC_CONVERGENT
- G_INTRINSIC_CONVERGENT_W_SIDE_EFFECTS

Out of the targets that currently have some support for GlobalISel, the patch
assumes that the convergent intrinsics only relevant to SPIRV and AMDGPU.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D154766
2023-07-31 12:15:39 +05:30
Sameer Sahasrabuddhe
7c760b224b Restore "[GlobalISel] GIntrinsic subclass to represent intrinsics in Generic Machine IR"
Some opcodes in generic MIR represent calls to intrinsics, where the intrinsic
ID is the first non-def operand to the instruction. These are now represented as
a subclass of GenericMachineInstr, and the method MachineInstr::getIntrinsicID()
is now moved to this subclass GIntrinsic.

Some target-defined instructions behave like GMIR intrinsics, and have an
Intrinsic::ID operand. But they should not be recognized as generic intrinsics,
and should not use GIntrinsic::getIntrinsicID(). Separated these out by
introducing a new AMDGPU::getIntrinsicID().

Reviewed By: arsenm, Pierre-vh

Differential Revision: https://reviews.llvm.org/D155556

This restores commit baa3386edb11a2f9bcadda8cf58d56f3707c39fa.
Originally reverted in d0f7850b01cf17e50a4f4b00e3b84dded94df6b8.
2023-07-27 14:49:17 +05:30
Sameer Sahasrabuddhe
d0f7850b01 Revert "[GlobalISel] GIntrinsic subclass to represent intrinsics in Generic Machine IR"
This reverts commit baa3386edb11a2f9bcadda8cf58d56f3707c39fa.

The changes did not cover all occurrences of the deteleted function
MachineInstr::getIntrinsicID().
2023-07-27 10:14:24 +05:30
Sameer Sahasrabuddhe
baa3386edb [GlobalISel] GIntrinsic subclass to represent intrinsics in Generic Machine IR
Some opcodes in generic MIR represent calls to intrinsics, where the intrinsic
ID is the first non-def operand to the instruction. These are now represented as
a subclass of GenericMachineInstr, and the method MachineInstr::getIntrinsicID()
is now moved to this subclass GIntrinsic.

Some target-defined instructions behave like GMIR intrinsics, and have an
Intrinsic::ID operand. But they should not be recognized as generic intrinsics,
and should not use GIntrinsic::getIntrinsicID(). Separated these out by
introducing a new AMDGPU::getIntrinsicID().

Reviewed By: arsenm, Pierre-vh

Differential Revision: https://reviews.llvm.org/D155556
2023-07-27 10:00:45 +05:30
Matt Arsenault
e3fd8f83a8 AMDGPU: Correctly expand f64 sqrt intrinsic
rocm-device-libs and llpc were avoiding using f64 sqrt
intrinsics in favor of their own expansions. Port the
expansion into the backend. Both of these users should be
updated to call the intrinsic instead.

The library and llpc expansions are slightly different.
llpc uses an ldexp to do the scale; the library uses a multiply.

Use ldexp to do the scale instead of the multiply.
I believe v_ldexp_f64 and v_mul_f64 are always the same number of
cycles, but it's cheaper to materialize the 32-bit integer constant
than the 64-bit double constant.

The libraries have another fast version of sqrt which will
be handled separately.

I am tempted to do this in an IR expansion instead. In the IR
we could take advantage of computeKnownFPClass to avoid
the 0-or-inf argument check.
2023-07-25 07:54:11 -04:00
Matt Arsenault
0295513238 AMDGPU: Filter out contract flags when lowering exp
It is unsafe to contract the fsub into the fmul. It also increases
code size by duplicating a constant.
2023-07-20 18:14:24 -04:00
Matt Arsenault
bd2dca0f74 AMDGPU: Use hex floats instead of ugly bitcasting 2023-07-17 19:54:04 -04:00
Matt Arsenault
fbe4ff8149 AMDGPU: Partially fix not respecting dynamic denormal mode
The most notable issue was producing v_mad_f32 in functions with the
dynamic mode, since it just ignores the mode. fdiv lowering is still
somewhat broken because it involves a mode switch and we need to query
the original mode.
2023-07-11 15:14:52 -04:00
Matt Arsenault
5491666248 AMDGPU: Correctly lower llvm.exp.f32
The library expansion has too many paths for all the permutations of
DAZ, unsafe and the 3 exp functions. It's easier to expand it in the
backend when we know all of these things. The library currently misses
the no-infinity check on the overflow, which this handles optimizing
out.

Some of the <3 x half> fast tests regress due to vector widening
dropping flags which will be fixed separately.

Apparently there is no exp10 intrinsic, but there should be. Adds some
deadish code in preparation for adding one while I'm following along
with the current library expansion.
2023-07-05 17:23:49 -04:00
Matt Arsenault
ed556a1ad5 AMDGPU: Correctly lower llvm.exp2.f32
Previously this did a fast math expansion only.
2023-07-05 17:23:48 -04:00
Matt Arsenault
9c82dc6a6b AMDGPU: Always use v_rcp_f16 and v_rsq_f16
These inherited the fast math checks from f32, but the manual suggests
these should be accurate enough for unconditional use. The definition
of correctly rounded is 0.5ulp, but the manual says "0.51ulp". I've
been a bit nervous about changing this as the OpenCL conformance test
does not cover half. Brute force produces identical values compared to
a reference host implementation for all values.
2023-07-05 16:53:01 -04:00
Matt Arsenault
4e15f378ee AMDGPU: Correctly lower llvm.log.f32 and llvm.log10.f32
Previously we expanded these in a fast-math way and the device
libraries were relying on this behavior. The libraries have a pending
change to switch to the new target intrinsic.

Unlike the library version, this takes advantage of no-infinities on
the result overflow check.
2023-07-05 15:30:35 -04:00
Matt Arsenault
003b58f65b IR: Add llvm.frexp intrinsic
Add an intrinsic which returns the two pieces as multiple return
values. Alternatively could introduce a pair of intrinsics to
separately return the fractional and exponent parts.

AMDGPU has native instructions to return the two halves, but could use
some generic legalization and optimization handling. For example, we
should be able to handle legalization of f16 on older targets, and for
bf16. Additionally antique targets need a hardware workaround which
would be better handled in the backend rather than in library code
where it is now.
2023-06-28 14:50:16 -04:00
Matt Arsenault
89ccfa1b39 AMDGPU: Use correct lowering for llvm.log2.f32
We previously directly codegened to v_log_f32, which is broken for
denormals. The lowering isn't complicated, you simply need to scale
denormal inputs and adjust the result. Note log and log10 are still
not accurate enough, and will be fixed separately.
2023-06-23 08:37:37 -04:00
Matt Arsenault
92ee60b66f AMDGPU: Drop and upgrade llvm.amdgcn.atomic.inc/dec to atomicrmw 2023-06-21 21:20:26 -04:00
Matt Arsenault
d0923a7739 AMDGPU: Correct constants used in fast math log expansion
The division between float constants was done with less
precision. Performing the divide in double and truncating to float
provides the same value as used in the library fast math expansion.
2023-06-12 21:11:41 -04:00
Matt Arsenault
4e4c351ae5 AMDGPU: Avoid endpgm in middle of block for fallback trap lowering.
This was inserting an s_endpgm in the middle of the block when it has
to be a terminator. Split the block and insert a branch to a new block
with the trap if it's not in a terminator position.

Fixes verifier error on LDS in function with no trap support (and
other trap sources).
2023-06-09 21:04:38 -04:00
Matt Arsenault
eece6ba283 IR: Add llvm.ldexp and llvm.experimental.constrained.ldexp intrinsics
AMDGPU has native instructions and target intrinsics for this, but
these really should be subject to legalization and generic
optimizations. This will enable legalization of f16->f32 on targets
without f16 support.

Implement a somewhat horrible inline expansion for targets without
libcall support. This could be better if we could introduce control
flow (GlobalISel version not yet implemented). Support for strictfp
legalization is less complete but works for the simple cases.
2023-06-06 17:07:18 -04:00
Krzysztof Drewniak
23098bd454 [AMDGPU] Add intrinsic for converting global pointers to resources
Define the function @llvm.amdgcn.make.buffer.rsrc, which take a 64-bit
pointer, the 16-bit stride/swizzling constant that replace the high 16
bits of an address in a buffer resource, the 32-bit extent/number of
elements, and the 32-bit flags (the latter two being the 3rd and 4th
wards of the resource), and combines them into a ptr addrspace(8).

This intrinsic is lowered during the early phases of the backend.

This intrinsic is needed so that alias analysis can correctly infer
that a certain buffer resource points to the same memory as some
global pointer. Previous methods of constructing buffer resources,
which relied on ptrtoint, would not allow for such an inference.

Depends on D148184

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D148957
2023-06-05 17:07:59 +00:00
Krzysztof Drewniak
ab37937812 [AMDGPU] Use resource base for buffer instruction MachineMemOperands
1. Remove the existing code that would encode the constant offsets (if
there were any) on buffer intrinsic operations onto their
`MachineMemOperand`s. As far as I can tell, this use of `offset` has
no substantial impact on the generated code, especially since the same
reasoning is performed by areMemAccessesTriviallyDisjoint().

2. When a buffer resource intrinsic takes a pointer argument as the
base resource/descriptor, place that memory argument in the value
field of the MachineMemOperand attached to that intrinsic.

This is more conservative than what would be produced by more typical
LLVM code using GEP, as the Value (for alias analysis purposes)
corresponding to accessing buffer[0] and buffer[1] is the same.
However, the target-specific analysis of disjoint offsets covers a lot
of the simple usecases.

Despite this limitation, the new buffer intrinsics, combined with
LLVM's existing pointer annotations, allow for non-trivial
optimizations, as seen in the new tests, where marking two buffer
descriptors "noalias" allows merging together loads and stores in a
"load from A, modify loaded value, store to B" sequence, which would
not be possible previously.

Depends on D147547

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D148184
2023-06-05 17:06:57 +00:00
Krzysztof Drewniak
faa2c678aa [AMDGPU] Add buffer intrinsics that take resources as pointers
In order to enable the LLVM frontend to better analyze buffer
operations (and to potentially enable more precise analyses on the
backend), define versions of the raw and structured buffer intrinsics
that use `ptr addrspace(8)` instead of `<4 x i32>` to represent their
rsrc arguments.

The new intrinsics are named by replacing `buffer.` with `buffer.ptr`.

One advantage to these intrinsic definitions is that, instead of
specifying that a buffer load/store will read/write some memory, we
can indicate that the memory read or written will be based on the
pointer argument. This means that, for example, a read from a
`noalias` buffer can be pulled out of a loop that is modifying a
distinct buffer.

In the future, we will define custom PseudoSourceValues that will
allow us to package up the (buffer, index, offset) triples that buffer
intrinsics contain and allow for more precise backend analysis.

This work also enables creating address space 7, which represents
manipulation of raw buffers using native LLVM load and store
instructions.

Where tests simply used a buffer intrinsic while testing some other
code path (such as the tests for VGPR spills), they have been updated
to use the new intrinsic form. Tests that are "about" buffer
intrinsics (for instance, those that ensure that they codegen as
expected) have been duplicated, either within existing files or into
new ones.

Depends on D145441

Reviewed By: arsenm, #amdgpu

Differential Revision: https://reviews.llvm.org/D147547
2023-06-05 16:59:07 +00:00
Matt Arsenault
2f5a116cf7 AMDGPU: Expand casted f16 fmed3 pattern to fmin/fmax on gfx8
If we have legal f16 instructions but no f16 med3, we can save
one instruction by expanding out the min/max sequence compared
to casting to f32 and casting back.
2023-05-23 08:48:25 +01:00
Mateja Marjanovic
cf76074a36 [AMDGPU][GlobalISel] Check exact width in get*ClassForBitWidth and widen if necessary
Instead of checking if the given bitwidth is less or equal to a bitwidth of an existing RegClass,
check if it has the exact same value.

For LLVM vector types that don't have a corresponding Register Class, widen them during legalization.
That goes for G_EXTRACT_VECTOR_ELT, G_INSERT_VECTOR_ELT and G_BUILD_VECTOR.

Differential revision: https://reviews.llvm.org/D148096
Reviewers: foad, arsenm
2023-05-03 17:32:24 +02:00
Mateja Marjanovic
6175ec0bb6 Revert "[AMDGPU][GlobalISel] Widen the vector operand in G_BUILD/INSERT/EXTRACT_VECTOR"
This reverts commit b25c7cafcbe1b52ea2d1ff5e5c2f13674b5f297d.
2023-05-03 17:28:01 +02:00
Mateja Marjanovic
b25c7cafcb [AMDGPU][GlobalISel] Widen the vector operand in G_BUILD/INSERT/EXTRACT_VECTOR
Widen the vector operand type in G_BUILD_VECTOR, G_INSERT_VECTOR_ELT,
G_EXTRACT_VECTOR_ELT to the nearest larger RegClass.
2023-05-03 17:14:38 +02:00
Changpeng Fang
3bc1e084ee AMDGPU: Created a subclass for the return address operand in the tail call return instruction
Summary:
  This is to avoid using the callee saved registers for the return address
of the tail call return instruction.

Reviewers:
  arsenm, cdevadas

Differential Revision:
  https://reviews.llvm.org/D147096
2023-04-10 10:53:33 -07:00
Jon Chesterfield
0507448d82 [amdgpu] Implement dynamic LDS accesses from non-kernel functions
The premise here is to allow non-kernel functions to locate external LDS variables without using LDS or extra magic SGPRs to do so.

1/ First it crawls the callgraph to work out which external LDS variables are reachable from a given kernel
2/ Then it creates a new `extern char[0]` variable for each kernel, which will alias all the other extern LDS variables because that's the documented behaviour of these variables
3/ The address of that variable is written to a lookup table. The global variable is tagged with metadata to track what address it was allocated at by codegen
4/ The assembler builds the lookup table using the metadata
5/ Any non-kernel functions use the same magic intrinsic used by table lookups of non-dynamic LDS variables to find the address to use

Heavy overlap with the code paths taken for other lowering, in particular the same intrinsic is used to pass the dynamic scope information through the same sgpr as for table lookups of static LDS.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D144233
2023-04-04 20:06:34 +01:00
Mariusz Sikora
ea064ee2a3 [AMDGPU] Create Subtarget Features for some of 16 bits atomic fadd instructions
Introducing Subtarget Features for instructions:
- ds_pk_add_bf16
- ds_pk_add_f16
- ds_pk_add_rtn_bf16
- ds_pk_add_rtn_f16
- flat_atomic_pk_add_f16
- flat_atomic_pk_add_bf16
- global_atomic_pk_add_f16
- global_atomic_pk_add_bf16
- buffer_atomic_pk_add_f16

Differential Revision: https://reviews.llvm.org/D146701
2023-03-24 13:10:40 +01:00
Jay Foad
dcb834843e [AMDGPU] Split SIModeRegisterDefaults out of AMDGPUBaseInfo. NFC.
This is only used by CodeGen. Moving it out of AMDGPUBaseInfo simplifies
future changes to make some of it depend on the subtarget.

Differential Revision: https://reviews.llvm.org/D144650
2023-02-23 16:38:15 +00:00