gfx940 and gfx941 are no longer supported. This is one of a series of
PRs to remove them from the code base.
This PR removes all documentation occurrences of gfx940/gfx941 except
for the gfx940 ISA description, which will be the subject of a separate
PR.
For SWDEV-512631
gfx940 and gfx941 are no longer supported. This is one of a series of
PRs to remove them from the code base.
This PR removes all non-documentation occurrences of gfx940/gfx941 from
the llvm directory, and the remaining occurrences in clang.
Documentation changes will follow.
For SWDEV-512631
Attempting to pass a `ptr addrspace(7)` to functions that take `ptr`
arguments produces undesirable `addrspacecast(addrspacecast(p8 x to p7)
to p0) => addrspacecast(p8 x to p0)` folds. This results in illegal GEP
operations on buffer resources, which can't be GEP'd. (However, note
that, while unimplemneted, addressspacecast from ptr addrspace(7) to ptr
is legal - it's just an effective address computation)
To resolve this problem, and thus prevent illegal
`getelementptr T, ptr addrspace(8) %x, ...` s from being produces, this
commit extends amdgcn.make.buffer.rsrc to also be variadic in its result
type, auto-upgrading old manglings.
The logic for handling a make.buffer.rsrc in instruction selection
remains untouched and expects the output type to be a ptr addrspace(8),
as does the Clang lowering for its builtin (the pointer-to-pointer
version might want a different name in clang). LowerBufferFatPointers
has been updated to lower
amdgcn.make.buffer.rsrc.p7.p* to amdgcn.make.buffer.rsrc.p8.p* .
This'll also make exposing buffer fat pointers in Clang easier, since
you don't have to cast between a `__amdgcn_rsrc_t` and a pointer.
Add a prologue to the kernel entry to handle cases where code designed
for kernarg preloading is executed on hardware equipped with
incompatible firmware. If hardware has compatible firmware the 256 bytes
at the start of the kernel entry will be skipped. This skipping is done
automatically by hardware that supports the feature.
A pass is added which is intended to be run at the very end of the
pipeline to avoid any optimizations that would assume the prologue is a
real predecessor block to the actual code start. In reality we have two
possible entry points for the function. 1. The optimized path that
supports kernarg preloading which begins at an offset of 256 bytes. 2.
The backwards compatible entry point which starts at offset 0.
This was a bit annoying because these introduce a new special case
encoding usage. op_sel is repurposed as a subset of dpp controls,
and is eligible for VOP3->VOP1 shrinking. For some reason fi also
uses an enum value, so we need to convert the raw boolean to 1 instead
of -1.
The 2 registers are swapped, so this has 2 defs. Ideally the builtin
would return a pair, but that's difficult so return a vector instead.
This would make a hypothetical builtin that supports v2f16 directly
uglier.
These use a new VOP3PX encoding for the v_mfma_scale_* instructions,
which bundles the pre-scale v_mfma_ld_scale_b32. None of the modifiers
are supported yet (op_sel, neg or clamp).
I'm not sure the intrinsic should really expose op_sel (or any of the
others). If I'm reading the documentation correctly, we should be able
to just have the raw scale operands and auto-match op_sel to byte
extract patterns.
The op_sel syntax also seems extra horrible in this usage, especially with the
usual assumed op_sel_hi=-1 behavior.
Add some docs clarifying how inactive lanes are handled in the
amdgpu_cs_chain calling convention when the llvm.amdgcn.init.whole.wave
intrinsic is used.
This patch introduces a new generic target, `gfx9-4-generic`. Since it doesn’t support FP8 and XF32-related instructions, the patch includes several code reorganizations to accommodate these changes.
0 does not make sense as a value for this to be, much less the default.
Also stop emitting each individual field if it is the default, rather than
if any element was the default. Also fix the name of the test since it didn't
exactly match the real attribute name.
Both input and output of ballot are lane-masks:
result is lane-mask with 'S32/S64 LLT and SGPR bank'
input is lane-mask with 'S1 LLT and VCC reg bank'.
Ballot copies bits from input lane-mask for
all active lanes and puts 0 for inactive lanes.
GlobalISel did not set 0 in result for inactive lanes
for non-constant input.
global_wb with scopes lower than SCOPE_SYS is unnecessary for
correctness.
I was initially optimistic they would be very cheap no-ops but they can
actually be quite expensive so let's avoid them.
Adds hidden kernel arguments to the function signature and marks them
inreg if they should be preloaded into user SGPRs. The normal kernarg
preloading logic then takes over with some additional checks for the
correct implicitarg_ptr alignment.
Special care is needed so that metadata for the hidden arguments is not
added twice when generating the code object.
Converts AMDGPUResourceUsageAnalysis pass from Module to MachineFunction
pass. Moves function resource info propagation to to MC layer (through
helpers in AMDGPUMCResourceInfo) by generating MCExprs for every
function resource which the emitters have been prepped for.
Fixes https://github.com/llvm/llvm-project/issues/64863
Propose an extension to base type DIEs such that DW_ATE_address-encoded
base types can include an architecture specific address space. Use this
to implement DW_OP_convert conversions between AMDGPU address space
addresses where meaningful.
This patch adds a description of the core file format used for AMDGPU.
Reference implementation for creating and loading AMDGPU core dump is
available in
[ROCgdb-6.2](https://github.com/ROCm/ROCgdb/tree/rocm-6.2.x/gdb)
Add a spec for yet-to-be-implemented metadata to allow the backend to
fully handle atomicrmw lowering. This is the base of an alternative
to #69229, which inverts the direction to be correct by default, and
extends to cover the peer device case.
These are incremental changes over #89217 , with core logic being the
same. This patch along with #89217 and #91190 should get us ready to enable 64
bit optimizations in atomic optimizer.
This patch is intended to be the first of a series with end goal to
adapt atomic optimizer pass to support i64 and f64 operations (along
with removing all unnecessary bitcasts). This legalizes 64 bit readlane,
writelane and readfirstlane ops pre-ISel
---------
Co-authored-by: vikramRH <vikhegde@amd.com>
The various preloaded registers (workitem IDs, workgroup IDs, and
various implicit pointers) always have a finite, invariant, well-defined
value throughout a well-defined program.
In cases where the compiler infers or the user declares that some
implicit input will not be used (ex. via amdgcn-no-workitem-id-y), the
behavior of the entire program is undefined, since that misdeclaration
can cause arbitrary other preloaded-register intrinsics to access the
wrong register. This case is not expected to arise in practice, but
could occur when the no implicit argument attributes were not cleared
correctly in the presence of external functions, indrect calls, or other
means of executing un-analyzable code. Failure to detect that case would
be a bug in the attributor.
This commit updates the documentation to reflect this long-standing
reality.
Then, on the basis that all implicit arguments are defined in all
correct programs, the intrinsics that return those values are
annototated with `noundef``. Some implicit pointer arguments gain a
`nonnull`, but the kernel argument segment pointer or implicit argument
pointers don't necessarily have this property.
This will prevent spurious calls to `freeze` in front-end optimizations
that destroy user-provided ranges on built-in IDs.
(While I'm here, this commit adds a test for `noundef` on kernel
arguments which is currently unimplemented)
Now that we've got (minus some issues around datatypes and invariant
loads) working lowerings for address space 7, update the table in the
AMDGPU usage guide to properly indicate the nature of these address
spaces.
Revert "Revert 4 last AMDGPU commits to unbreak Windows bots"
This reverts commit 0d493ed2c6e664849a979b357a606dcd8273b03f.
MSVC does not like constexpr on the definition after an extern
declaration of a global.
Revert "AMDGPU: Try to fix build error with old gcc"
This reverts commit c7ad12d0d7606b0b9fb531b0b273bdc5f1490ddb.
Revert "AMDGPU: Use umin in set.rounding expansion"
This reverts commit a56f0b51dd988ad2b533de759c98457c1ed42456.
Revert "AMDGPU: Optimize set_rounding if input is known to fit in 2 bits (#88588)"
This reverts commit b4e751e2ab0ff152ed18dea59ebf9691e963e1dd.
Revert "AMDGPU: Implement llvm.set.rounding (#88587)"
This reverts commit 9731b77e80261c627d79980f8c275700bdaf6591.
Use a shift of a magic constant and some offseting to convert from
flt_rounds values.
I don't know why the enum defines Dynamic = 7. The standard suggests -1
is the cannot determine value. If we could start the extended values at
4 we wouldn't need the extra compare sub and select.
https://reviews.llvm.org/D153257