If we can't fold a PTRADD's offset into its users, lowering them to
disjoint ORs is preferable: Often, a 32-bit OR instruction suffices
where we'd otherwise use a pair of 32-bit additions with carry.
This needs to be a DAGCombine (and not a selection rule) because its
main purpose is to enable subsequent DAGCombines for bitwise operations.
We don't want to just turn PTRADDs into disjoint ORs whenever that's
sound because this transform loses the information that the operation
implements pointer arithmetic, which AMDGPU for instance needs when
folding constant offsets.
For SWDEV-516125.
There are more places in SIISelLowering.cpp and AMDGPUISelDAGToDAG.cpp
that check for ISD::ADD in a pointer context, but as far as I can tell
those are only relevant for 32-bit pointer arithmetic (like frame
indices/scratch addresses and LDS), for which we don't enable PTRADD
generation yet.
For SWDEV-516125.
Previously this would only accept full copy hints. This relaxes
this to accept some subregister copies. Specifically, this now
accepts:
- Copies to/from physical registers if there is a compatible
super register
- Subreg-to-subreg copies
This has the potential to repeatedly add the same hint to the
hint vector, but not sure if that's a real problem.
This patch mirrors similar patterns for ISD::ADD. The main difference is
that ISD::ADD is commutative, so that a pattern definition for, e.g.,
(add (mul x, y), z), automatically also handles (add z, (mul x, y)).
ISD::PTRADD is not commutative, so we would need to handle these cases
explicitly. This patch only implements (ptradd z, (op x, y)) patterns,
where the nested operation (shift or multiply) is the offset of the
ptradd (i.e., the right operand), since base pointers that are the
result of a shift or multiply seem less likely.
For SWDEV-516125.
Add new event SCC_WRITE for s_barrier_signal_isfirst and s_barrier_leave,
instructions that write to SCC, counter is KM_CNT.
Also start tracking SCC for reads and writes.
s_barrier_wait on the same barrier guarantees that the SCC write from
s_barrier_signal_isfirst has landed, no need to insert s_wait_kmcnt.
It seems the VMEM access on hi/lo half could interfere the other half.
Track waitcnt of vgpr32 instead of vgpr16 for 16bit reg in GFX11.
---------
Co-authored-by: Joe Nash <joseph.nash@amd.com>
Fixes https://github.com/iree-org/iree/issues/22001
The visitor in SplitPtrStructs would re-visit instructions if an
instruction earlier in program order caused a recursive visit() call via
getPtrParts(). This would cause instructions to be processed multiple
times.
As a consequence of this, PHI nodes could be added to the Conditionals
array multiple times, which would to a conditinoal that was already
simplified being processed multiple times. After the code moved to
InstSimplifyFolder, this re-processing, combined with more agressive
simplifications, would lead to an attempt to replace an instruction with
itself, causing an assertion failure and crash.
This commit resolves the issue and adds the reduced form of the crashing
input as a test.
Hardware initializes a single value in ttmp9 which is either the
workgroup ID X or cluster ID X. Most of this patch is a refactoring to
use a single `PreloadedValue` enumerator for this value, instead of two
enumerators `WORKGROUP_ID_X` and `CLUSTER_ID_X` referring to the same
value.
This makes it simpler to have a single attribute
`amdgpu-no-workgroup-id-x` indicating that this value is not used, which
in turns sets the TGID_EN_X bit appropriately to tell the hardware
whether to initialize it.
All of the above applies to Y and Z similarly.
Fixes: LWPSCGFX13-568
Co-authored-by: Jay Foad <jay.foad@amd.com>
After replacing VGPR MFMAs with the AGPR form, we've alleviated VGPR
pressure which may have triggered spills during allocation. Identify
these spill slots, and try to reassign them to newly freed VGPRs,
and replace the spill instructions with copies.
Fixes#154260
The S_NOP instruction has an immediate operand which is one less than
the number of cycles to delay for. The maximum value that may be encoded
in this field was increased in GFX8 and again in GFX12.
There is a cache on the known-bit computed by global-isel. It only works
inside a single query to computeKnownBits, which limits its usefulness,
and according to the tests can sometimes limit the effectiveness of
known-bits queries. (Although some AMD tests look longer). Keeping the
cache valid and clearing it at the correct times can also require being
careful about the functions called inside known-bits queries.
I measured compile-time of removing it and came up with:
```
7zip 2.06405E+11 2.06436E+11 0.015018992
Bullet 1.01298E+11 1.01186E+11 -0.110236169
ClamAV 57942466667 57848066667 -0.16292023
SPASS 45444466667 45402966667 -0.091320249
consumer 35432466667 35381233333 -0.144594317
kimwitu++ 40858833333 40927933333 0.169118877
lencod 70022366667 69950633333 -0.102443457
mafft 38439900000 38413233333 -0.069372362
sqlite3 35822266667 35770033333 -0.145812474
tramp3d 82083133333 82045600000 -0.045726
Average -0.068828739
```
The last column is % difference between with / without the cache. So in
total it seems to be costing slightly more to keep the current
known-bits cache than if it was removed. (Measured in instruction count,
similar to llvm-compile-time-tracker).
The hit rate wasn't terrible - higher than I expected. In the
llvm-test-suite+external projects it was hit 4791030 times out of
91107008 queries, slightly more than 5%.
Note that as globalisel increases in complexity, more known bits calls
might be made and the numbers might shift. If that is the case it might
be better to have a cache that works across calls, providing it doesn't
make effectiveness worse.
Since many code are connected, this also changes how workgroup id is lowered.
Co-authored-by: Jay Foad <jay.foad@amd.com>
Co-authored-by: Ivan Kosarev <ivan.kosarev@amd.com>
This change was motivated by CK where many VMCNT(0)'s were generated due
to instructions lacking !alias.scope metadata. The two causes of this
were:
1) LowerLDSModule not tacking on scope metadata on a single LDS variable
2) IPSCCP pass before inliner replacing noalias ptr derivative with a
global value, which made inliner unable to track it back to the noalias
ptr argument.
However, it turns out that IPSCCP losing the scope information was
largely ineffectual as ScopedNoAliasAA was able to handle asymmetric
condition, where one MemLoc was missing scope, and still return NoAlias
result.
AMDGPU however was checking for existence of scope in SIInsertWaitcnts
and conservatively treating it as aliasing all and inserted VMCNT(0)
before DS_READs, forcing it to wait for all previous LDS DMA
instructions.
Since we know that ScopedNoAliasAA can handle asymmetry, we should also
allow AA query to determine if two MIs may alias.
Passed PSDB.
Previous attempt to address the issue in IPSCCP, likely stalled:
https://github.com/llvm/llvm-project/pull/154522
This solution may be preferrable over that as issue only affects AMDGPU.
I have no idea why this was here. MIMG atomics use tied operands
for the input and output, so AV classes should have always worked.
We have poor test coverage for AGPRs with atomics, so add a partial
set. Everything seems to work OK, although it seems image cmpswap
always uses VGPRs unnecessarily.
Find a common subclass instead of directly checking for a subclass
relationship. This fixes folding logic for unaligned register defs
into aligned use contexts. e.g., a vreg_64 def into an av_64_align2
use should be able to find the common subclass vreg_align2. This
avoids regressions in future patches.
Checking the subclass was also redundant on the subregister path;
getMatchingSuperRegClass is sufficient.
Restrict to VGPR only (VRegSrc_32) for mfma scale operands to workaround
a hardware design defect: For all Inline/SGPR constants, SP HW use bits
[30:23] as the scale.
TODO: We may still be able to allow Inline Constants/SGPR, with a proper
shift, to obtain a potentially better performance.
Fixes: SWDEV-548629
- Enable s_or_b64/s_and_b64/s_xor_b64 for v2i32. Add various additional
combines to make use of these newly legalised instructions.
- Update several tests and separate legacy r600 tests where necessary.
Add G_AMDGPU_LOAD_D16 generic instructions and GINodeEquivs for them,
this will import D16 load patterns to global-isel's tablegened
instruction selector.
For newly imported patterns to work add combines for G_AMDGPU_LOAD_D16
in AMDGPURegBankCombiner.
Cover all the missing cases and add very detailed tests for each rule.
In summary:
- Flat and Scratch, addrspace(0) and addrspace(5), loads are always
divergent.
- Global and Constant, addrspace(1) and addrspace(4), have real uniform
loads, s_load, but require additional checks for align and flags in mmo.
For not natural align or not uniform mmo do uniform-in-vgpr lowering.
- Private, addrspace(3), only has instructions for divergent load, for
uniform do uniform-in-vgpr lowering.
- Store rules are simplified using Ptr32 and Ptr64.
All operands need to be vgpr.
Some tests have code size regression since they use more sgpr instructions,
marked with FixMe comment to get back to later.