Unlike LLVM IR `Instruction::eraseFromParent()`,
`MachineInstr::eraseFromParent()` is void and does not return the
iterator following the erased instruction. Returning an iterator can be
very helpful for example when we are erasing MachineInstrs while
iterating, as it provides a convenient way to get a valid iterator.
This patch updates `MachineInstr::eraseFromParent()` to return a
`MachineBlock::iterator` (which is a
`MachineInstrBundleIterator<MachineInstr>`). If the erased instruction
is the head of a bundle, then the returned iterator points to the next
bundle (see unittest).
- Add `MONonVolatile` MachineMemOperand flag.
- Set nv=1 on memory operations on GFX12.5 if the operation accesses a
constant address space,
is an invariant load, or has the `MONonVolatile` flag set.
The xcnt wait is actually required before any memory access that can
only be done once, so atomic stores and volatile accesses are affected.
This patch also ensures buffer instructions are handled.
Also breaks the long inheritance chains by making both
`SIGfx10CacheControl` and
`SIGfx12CacheControl` inherit from `SICacheControl` directly.
With this patch, we now just have 3 `SICacheControl` implementations
that each
do their own thing, and there is no more code hidden 3 superclasses
above (which made this code harder to read and maintain than it needed
to be).
Merge the following classes into `SIGfx6CacheControl`:
- SIGfx7CacheControl
- SIGfx90ACacheControl
- SIGfx940CacheControl
They were all very similar and had a lot of duplicated boilerplate just
to implement one or two codegen differences. GFX90A/GFX940 have a bit
more differences, but they're still manageable under one class because
the general behavior is the same.
This removes 500 lines of code and puts everything into a single place
which I think makes it a lot easier to maintain, at the cost of a slight
increase in complexity for some functions.
There is still a lot of room for improvement but I think this patch is
already big enough as is and I don't want to bundle too much into one
review.
They were previously optimized to not emit any waitcnt, which is
technically correct because there is no reordering of operations at
workgroup scope in CU mode for GFX10+.
This breaks transitivity however, for example if we have the following
sequence of events in one thread:
- some stores
- store atomic release syncscope("workgroup")
- barrier
then another thread follows with
- barrier
- load atomic acquire
- store atomic release syncscope("agent")
It does not work because, while the other thread sees the stores, it
cannot release them at the wider scope. Our release fences aren't strong
enough to "wait" on stores from other waves.
We also cannot strengthen our release fences any further to allow for
releasing other wave's stores because only GFX12 can do that with
`global_wb`. GFX10-11 do not have the writeback instruction.
It'd also add yet another level of complexity to code sequences, with
both acquire/release having CU-mode only alternatives.
Lastly, acq/rel are always used together. The price for synchronization
has to be paid either at the acq, or the rel. Strengthening the releases
would just make the memory model more complex but wouldn't help
performance.
So the choice here is to streamline the code sequences by making CU and
WGP mode emit almost identical (vL0 inv is not needed in CU mode) code
for release (or stronger) atomic ordering.
This also removes the `vm_vsrc(0)` wait before barriers. Now that the
release fence in CU mode is strong enough, it is no longer needed.
Supersedes #160501
Solves SC1-6454
The primary purpose of this commit is to enable marking loads to LDS
(global.load.lds, buffer.*.load.lds) volatile (using bit 31 of the aux
as with normal buffer loads) and to ensure that their !nontemporal
annotations translate to appropriate settings of te cache control bits.
However, in the process of implementing this feature, we also fixed
- Incorrect handling of buffer loads to LDS in GlobalISel
- Updating the handling of volatile on buffers in SIMemoryLegalizer:
previously, the mapping of address spaces would cause volatile on buffer
loads to be silently dropped on at least gfx10.
---------
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
The check used was not strong enough to prevent the insertion of sample/bvhcnt when they were not needed.
I assume SIInsertWaitCnts was trimming those away anyway, but this was a bug nonetheless.
We were inserting SAMPLE/BVHcnt waits in places where we only needed to wait on the previous atomic operation. Neither of these counter have any atomics associated with them.
Mostly NFC, and adds an assertion for gfx12 to ensure that no atomic scratch
instructions are present in the case of GloballyAddressableScratch. This should
always hold because of #154710.
A fence release could be followed by a barrier, so it should wait for
the relevant memory accesses to complete, even if it is mmra-limited to
LDS. So far, that would be skipped for non-global fence releases.
Fixes SWDEV-554932.
Implements the base of the MemoryLegalizer for a roughly correct GFX1250 memory model.
Documentation will come later, and some remaining changes still have to be added, but this is the backbone of the model.
- Add clang built-ins + sema/codegen
- Add IR Intrinsic + verifier
- Add DAG/GlobalISel codegen for the intrinsics
- Add lowering in SIMemoryLegalizer using a MMO flag.
Re-enable it by adding a wait on vm_vsrc before every barrier "start"
instruction in GFX10/11/12 CU mode.
This is a less strong wait than what we do without BackOffBarrier, thus
this shouldn't introduce
any new guarantees that can be abused, instead it relaxes the guarantees
we have now to the bare
minimum needed to support the behavior users want (fence release +
barrier works).
There is an exact memory model in the works which will be documented
separately.
The new instruction represents the unknown number of waitcnts needed at a
release operation to ensure that prior direct loads to LDS (formerly called LDS
DMA) are completed. The instruction is replaced in SIInsertWaitcnts with a
suitable value for vmcnt().
Co-authored-by: Austin Kerbow <austin.kerbow@amd.com>.
We can do it all in finalizeStore if we ensure it always sees the
stores.
For that, I needed to fix a hidden bug where finalizeStore wouldn't see
all stores
because sometimes the iterator got out-of-sync and didn't point to the
store anymore.
This also removes the waits before volatile LDS stores which never
needed it, that was a bug until now.
"amdgpu-as" is way too vague and doesn't give enough context.
We may want to support it on normal atomics too, to control the synchronized (ordered) AS.
If we do that, the name has to be less vague.
This annotates the `Twine` passed to the constructors of the various
DiagnosticInfo subclasses with `[[clang::lifetimebound]]`, which causes
us to warn when we would try to print the twine after it had already
been destructed.
We also update `DiagnosticInfoUnsupported` to hold a `const Twine &`
like all of the other DiagnosticInfo classes, since this warning allows
us to clean up all of the places where it was being used incorrectly.
gfx940 and gfx941 are no longer supported. This is one of a series of
PRs to remove them from the code base.
This PR removes all non-documentation occurrences of gfx940/gfx941 from
the llvm directory, and the remaining occurrences in clang.
Documentation changes will follow.
For SWDEV-512631
global_wb with scopes lower than SCOPE_SYS is unnecessary for
correctness.
I was initially optimistic they would be very cheap no-ops but they can
actually be quite expensive so let's avoid them.
Using MMRAs, allow `builtin_amdgcn_fence` to emit fences that only
target one or more address spaces, instead of fencing all address spaces
at once.
This is done through a `amdgpu-as` MMRA. Currently focused on OpenCL
fences, but can very easily support more AS names and codegen on more
than just fences.
Insert waitcnts for loads and atomics before stores with system scope.
Scope is field in instruction encoding and corresponds to desired
coherence level in cache hierarchy.
Intrinsic stores can set scope in cache policy operand.
If volatile keyword is used on generic stores memory legalizer will set
scope to system. Generic stores, by default, get lowest scope level.
Waitcnts are not required if it is guaranteed that memory is cached.
For example vulkan shaders can guarantee this.
TODO: implement flag for frontends to give us a hint not to insert
waits.
Expecting vulkan flag to be implemented as vulkan:private MMRA.
Update SIMemoryLegalizer and SIInsertWaitcnts to use separate wait
instructions per counter (e.g. S_WAIT_LOADCNT) and split VMCNT into
separate LOADCNT, SAMPLECNT and BVHCNT counters.