The existing "LDS DMA" builtins/intrinsics copy data from global/buffer
pointer to LDS. These are now augmented with their ".async" version,
where the compiler does not automatically track completion. The
completion is now tracked using explicit mark/wait intrinsics, which
must be inserted by the user. This makes it possible to write programs
with efficient waits in software pipeline loops. The program can now
wait for only the oldest outstanding operations to finish, while
launching more operations for later use.
This change only contains the new names of the builtins/intrinsics,
which continue to behave exactly like their non-async counterparts. A
later change will implement the actual mark/wait semantics in
SIInsertWaitcnts.
This is part of a stack split out from #173259:
- #180467
- #180466
Fixes: SWDEV-521121
This patch changes the memset lowering to match the optimized memcpy lowering.
The memset lowering now queries TTI.getMemcpyLoopLoweringType for a preferred
memory access type. If that type is larger than a byte, the memset is lowered
into two loops: a main loop that stores a sufficiently wide vector splat of the
SetValue with the preferred memory access type and a residual loop that covers
the remaining bytes individually. If the memset size is statically known, the
residual loop is replaced by a sequence of stores.
This improves memset performance on gfx1030 (AMDGPU) in microbenchmarks by
around 7-20x.
I'm planning similar treatment for memset.pattern as a follow-up PR.
For SWDEV-543208.
This changes LangRef to specify that pointer icmp only compares the
address bits of the pointers. That is, `icmp pred %a, %b` is equivalent
to `icmp pred ptrtoaddr(%a), ptrtoaddr(%b)`.
Similarly, it specifies that the `nonnull` attribute requires that the
address bits are non-zero.
There are a couple of motivations for this:
* For inequality comparisons, this is really the only sensible
semantics. Relational comparison of address and metadata bits as a
single integer is generally meaningless (unless the metadata bits are
equal).
* This matches (as far as I understand) the behavior of existing CHERI
implementations.
* LLVM can only reason about the address bits. These semantics allow
pointers with non-address bits to receive essentially the same
comparison optimization support as ordinary pointers.
In terms of implementation, this PR adjusts:
* The AMDGPULowerBufferFatPointers pass.
* An InstCombine fold that may replace pointers with different
non-address bits.
* The fold that replaces pointers based on dominating pointer equality.
It does not adjust:
* ISel, because we don't have in-tree targets where we can show a
difference.
* Various icmp+ptrtoint transforms, because we'll have to change this
code for ptrtoaddr optimization support anyway, and these changes are
tightly related.
Related discussion starting from:
https://discourse.llvm.org/t/clarifiying-the-semantics-of-ptrtoint/83987/60?u=nikic
Reverts llvm/llvm-project#154069. I pointed out a number of issues
post-merge, most importantly examples of miscompiles:
https://github.com/llvm/llvm-project/pull/154069#issuecomment-3603854626.
While the motivation of the change is clear, I think the implementation
approach is flawed. It seems like the goal is to allow elements like
`load <2xi16>` and `load i32` to be vectorized together despite the
current algorithm not grouping them into the same equivalence classes. I
personally think that if we want to attempt this it should be a more
wholistic approach, maybe even redefining the concept of an equivalence
class. This current solution seems like it would be really hard to do
bug-free, and even if the bugs were not present, it is only able to
merge chains that happen to be adjacent to each other after
`splitChainByContiguity`, which seems like it is leaving things up to
chance whether this optimization kicks in. But we can discuss more in
the re-land. Maybe the broader approach I'm proposing is too difficult,
and a narrow optimization is worthwhile. Regardless, this should be
reverted, it needs more iteration before it is correct.
This change enables the LoadStoreVectorizer to merge and vectorize
contiguous chains even when their scalar element types differ, as long
as the total bitwidth matches. To do so, we rebase offsets between
chains, normalize value types to a common integer type, and insert the
necessary casts around loads and stores. This uncovers more
vectorization opportunities and explains the expected codegen updates
across AMDGPU tests.
Key changes:
- Chain merging
- Build contiguous subchains and then merge adjacent ones when:
- They refer to the same underlying pointer object and address space.
- They are either all loads or all stores.
- A constant leader-to-leader delta exists.
- Rebasing one chain into the other's coordinate space does not overlap.
- All elements have equal total bit width.
- Rebase the second chain by the computed delta and append it to the
first.
- Type normalization and casting
- Normalize merged chains to a common integer type sized to the total
bits.
- For loads: create a new load of the normalized type, copy metadata,
and cast back to the original type for uses if needed.
- For stores: bitcast the value to the normalized type and store that.
- Insert zext/trunc for integer size changes; use bit-or-pointer casts
when sizes match.
- Cleanups
- Erase replaced instructions and DCE pointer operands when safe.
- New helpers: computeLeaderDelta, chainsOverlapAfterRebase,
rebaseChain, normalizeChainToType, and allElemsMatchTotalBits.
Impact:
- Increases vectorization opportunities across mixed-typed but
size-compatible access chains.
- Large set of expected AMDGPU codegen diffs due to more/changed
vectorization.
This PR resolves#97715.
If the input to LowerBufferFatPointers is such that the resource- and
offset-specific `select` instructions generated for a `select` on `ptr
addrspae(7)` fold away, the pass would crash when trying to replace an
instruction with itself. This commit resolves the issue.
Fixes https://github.com/iree-org/iree/issues/22551
Fix a crash that would arise when intrinsics like llvm.masked.load.T.p7
were left in the module when AMDGPULowerBufferFatPointers was applied
and so a captures(none) annotation would be applied to a non-pointer
value, triggering a verifier failure.
---------
Co-authored-by: Shilei Tian <i@tianshilei.me>
Fixes https://github.com/iree-org/iree/issues/22001
The visitor in SplitPtrStructs would re-visit instructions if an
instruction earlier in program order caused a recursive visit() call via
getPtrParts(). This would cause instructions to be processed multiple
times.
As a consequence of this, PHI nodes could be added to the Conditionals
array multiple times, which would to a conditinoal that was already
simplified being processed multiple times. After the code moved to
InstSimplifyFolder, this re-processing, combined with more agressive
simplifications, would lead to an attempt to replace an instruction with
itself, causing an assertion failure and crash.
This commit resolves the issue and adds the reduced form of the crashing
input as a test.
Fixes#154056.
The fat buffer lowering pass was erroniously detecting that it did not
need to run on functions that only load/store to the null constant (or
other such constants). We thought this would be covered by specializing
constants out to instructions, but that doesn't account foc trivial
constants like null. Therefore, we check the operands of instructions
for buffer fat pointers in order to find such constants and ensure the
pass runs.
---------
Co-authored-by: Nikita Popov <github@npopov.com>
Not quite NFC as it looks like the original intrinsic-handling code
never got updated to use records. This was never caught because that
code wasn't tested. I've adjusted an existing test so the behaviour is
now covered.
It appears this wasn't handled in the initial migration a year ago,
seemingly because it didn't lead to any test failures. Find and interpret
debug records in the same way the original code handled intrinsics. Note
that we drop a call to copyMetadata: debug records can't carry additional
metadata like instructions, nothing relies on this in AMDGPU AFAIUI.
This flag was used to let us incrementally introduce debug records
into LLVM, however everything is now using records. It serves no
purpose now, so delete it.
This fixes#139614 on non-clang compilers by moving `__has_warning`
completely inside the `#if defined(__clang__)` block. This prevents a
parse failure from compilers which don't recognize `__has_warning`.
Original description:
Followup to #138741.
This adds the requested macro to silence
`-Wunnecessary-virtual-specifier` when declaring virtual anchor
functions in `final` classes, per [LLVM
policy](https://llvm.org/docs/CodingStandards.html#provide-a-virtual-method-anchor-for-classes-in-headers).
It also cleans up any remaining instances of the warning, allowing us to
stop disabling it when we build LLVM.
These are identified by misc-include-cleaner. I've filtered out those
that break builds. Also, I'm staying away from llvm-config.h,
config.h, and Compiler.h, which likely cause platform- or
compiler-specific build failures.
Some application code operating on generic pointers (that then gete
initialized to buffer fat pointers) may perform tests against nullptr.
After address space inference, this results in comparisons against
`addrspacecast (ptr null to ptr addrspace(7))`, which were crashing.
However, while general casts to ptr addrspace(7) from generic pointers
aren't supposted, it is possible to cast null pointers to the all-zerose
bufer resource and 0 offset, which this patch adds.
It also adds a TODO for casting _out_ of buffer resources, which isn't
implemented here but could be.
This PR adds a amdgns_load_to_lds intrinsic that abstracts over loads to
LDS from global (address space 1) pointers and buffer fat pointers
(address space 7), since they use the same API and "gather from a
pointer to LDS" is something of an abstract operation.
This commit adds the intrinsic and its lowerings for addrspaces 1 and 7,
and updates the MLIR wrappers to use it (loosening up the restrictions
on loads to LDS along the way to match the ground truth from target
features).
It also plumbs the intrinsic through to clang.
This patch adds support for LLVM IR atomicrmw `fmaximum` and `fminimum`
instructions.
These mirror the `llvm.maximum.*` and `llvm.minimum.*` instructions, but
are atomic and use IEEE754 2019 handling for NaNs, which is different to
`fmax` and `fmin`. See:
https://llvm.org/docs/LangRef.html#llvm-minimum-intrinsic
for more details.
Future changes will allow this LLVM IR to be lowered to specialised
assembler instructions on suitable targets, such as AArch64.
Now that IRMover and the rest of LLVM don't allow recursive types, drop
support for them from the clone of the IRMover code used when lowering
buffer fat pointer operations.
This patch adds support for LLVM IR atomicrmw `fmaximum` and `fminimum`
instructions.
These mirror the `llvm.maximum.*` and `llvm.minimum.*` instructions, but
are atomic and use IEEE754 2019 handling for NaNs, which is different to
`fmax` and `fmin`. See:
https://llvm.org/docs/LangRef.html#llvm-minimum-intrinsic
for more details.
Future changes will allow this LLVM IR to be lowered to specialised
assembler instructions on suitable targets, such as AArch64.
This reverts commit d1a05721172272f7aab685b56d99e86814a15bff.
There was further discussion on the PR about whether the intinsics
should exist in this form.
Add a buffer_fat_ptr_load_lds intrinsic, by analogy with
global_load_lds, which enables using `ptr addrspace(7)` to set the rsrc
and offset arguments to raw_ptr_buffer_load_lds.
This PR updates AMDGPULowerBufferFatPointers to use the
InstSimplifyFolder
when creating IR during buffer fat pointer lowering.
This shouldn't cause any large functional changes and might improve the
quality of the generated code.
Since LowerBufferFatPointers runs before PreISelIntrinsicLowering, which
normally handles unsupported memcpy()s,, and since you can't have a
`noalias {ptr addrspace(8), i32}` becasue it crashes later passes,
manually expand memcpy()s involving buffer fat pointers to loops.
Additionally, though they're unlikely to be used, this commit adds
support for memset().
This commit doesn't implement writing direct-to-LDS loads as the
intrinsics, but leaves the option in the future.
Attempting to pass a `ptr addrspace(7)` to functions that take `ptr`
arguments produces undesirable `addrspacecast(addrspacecast(p8 x to p7)
to p0) => addrspacecast(p8 x to p0)` folds. This results in illegal GEP
operations on buffer resources, which can't be GEP'd. (However, note
that, while unimplemneted, addressspacecast from ptr addrspace(7) to ptr
is legal - it's just an effective address computation)
To resolve this problem, and thus prevent illegal
`getelementptr T, ptr addrspace(8) %x, ...` s from being produces, this
commit extends amdgcn.make.buffer.rsrc to also be variadic in its result
type, auto-upgrading old manglings.
The logic for handling a make.buffer.rsrc in instruction selection
remains untouched and expects the output type to be a ptr addrspace(8),
as does the Clang lowering for its builtin (the pointer-to-pointer
version might want a different name in clang). LowerBufferFatPointers
has been updated to lower
amdgcn.make.buffer.rsrc.p7.p* to amdgcn.make.buffer.rsrc.p8.p* .
This'll also make exposing buffer fat pointers in Clang easier, since
you don't have to cast between a `__amdgcn_rsrc_t` and a pointer.
The lowering for GEP didn't properly support the case where the pointer
argument was being implicitly broadcast by a vector of indices. Fix
that.
---------
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
The current lowering for ptr addrspace(7) assumed that the instruction
selector can handle arbtrary LLVM types, which is not the case. Code
generation can't deal with
- Values that aren't 8, 16, 32, 64, 96, or 128 bits long
- Aggregates (this commit only handles arrays of scalars, more may come)
- Vectors of more than one byte
- 3-word values that aren't a vector of 3 32-bit values (for axample, a
<6 x half>)
This commit adds a buffer contents type legalizer that adds the needed
bitcasts, zero-extensions, and splits into subcompnents needed to
convert a load or store operation into one that can be successfully
lowered through code generation.
In the long run, some of the involved bitcasts (though potentially not
the buffer operation splitting) ought to be handled by the instruction
legalizer, but SelectionDAG makes this difficult.
It also takes advantage of the new `nuw` flag on `getelementptr` when
lowering GEPs to offset additions.
We don't currently plumb through `nsw` on GEPs since that should likely
be a separate change and would require declaring what we mean by "the
address" in the context of the GEP guarantees.
This commit usis the `nuw` flag on `getelemnetptr` to set the `nuw` flag
on buffer offset additions, and also moves from `inbounds` to the looser
`nusw` for the existing case.
This patch works around:
llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp:1101:13:
error: enumeration values 'USubCond' and 'USubSat' not handled in
switch [-Werror,-Wswitch]
I've notified the author in #105568.
Upstream the intrinsics `llvm.amdgcn.raw.atomic.buffer.load`
and `llvm.amdgcn.raw.atomic.ptr.buffer.load`.
These additional intrinsics mark atomic buffer loads
as atomic to LLVM by removing the `IntrReadMem`
attribute. Otherwise, it could hoist these
intrinsics out of loops in cases where LLVM marks
them as invariant. That can cause issues such as
infinite loops.
Continuation of https://reviews.llvm.org/D138786
with the additional use in the fat buffer lowering,
more test cases and the additional ptr versions
of these intrinsics.
---------
Co-authored-by: rtayl <>
Co-authored-by: Jay Foad <jay.foad@amd.com>
Co-authored-by: Mariusz Sikora <mariusz.sikora@amd.com>
This is a helper to avoid writing `getModule()->getDataLayout()`. I
regularly try to use this method only to remember it doesn't exist...
`getModule()->getDataLayout()` is also a common (the most common?)
reason why code has to include the Module.h header.