For flat memory instructions where the address is supplied as a base address
register with an immediate offset, the memory aperture test ignores the
immediate offset. Currently, SDISel does not respect that, which leads to
miscompilations where valid input programs crash when the address computation
relies on the immediate offset to get the base address in the proper memory
aperture. Global or scratch instructions are not affected.
This patch only selects flat instructions with immediate offsets from PTRADD
address computations with the inbounds flag: If the PTRADD does not leave the
bounds of the allocated object, it cannot leave the bounds of the memory
aperture and is therefore safe to handle with an immediate offset.
Affected tests:
- CodeGen/AMDGPU/fold-gep-offset.ll: Offsets are no longer wrongly folded, added
new positive tests where we still do fold them.
- CodeGen/AMDGPU/infer-addrspace-flat-atomic.ll: Offset folding doesn't seem
integral to this test, so the test is not changed to make offset folding still
happen.
- CodeGen/AMDGPU/loop-prefetch-data.ll: loop-reduce transforms inbounds
addresses for accesses to be based on potentially OOB addresses used for
prefetching.
- I think the remaining ones suffer from the limited preservation of the
inbounds flag in PTRADD DAGCombines due to the provenance problems pointed out
in PR #165424 and the fact that
`AMDGPUTargetLowering::SplitVector{Load|Store}` legalizes too-wide accesses by
repeatedly splitting them in half. Legalizing a V32S32 memory accesses
therefore leads to inbounds ptradd chains like (ptradd inbounds (ptradd
inbounds (ptradd inbounds P, 64), 32), 16). The DAGCombines fold them into a
single ptradd, but the involved transformations generally cannot preserve the
inbounds flag (even though it would be valid in this case).
Similar previous PR that relied on `ISD::ADD inbounds` instead of `ISD::PTRADD inbounds` (closed): #132353
Analogous PR for GISel (merged): #153001
Fixes SWDEV-516125.
Do not add latency for wavefront and singlethread scope fences during
barrier latency DAG mutation.
These scopes do not typically introduce any latency and adjusting
schedules based on them significantly impacts latency hiding.
On some targets, a packed f32 instruction can only read 32 bits from a
scalar operand (SGPR or literal) and replicates the bits to both
channels. In this case, we should not fold an immediate value if it
can't be replicated from its lower 32-bit.
Fixes SWDEV-567139.
In general, "Flat instructions look at the per-workitem address and
determine for each work item if the target memory address is in global,
private or scratch memory." (RDNA2 ISA) That means that FLAT
instructions need to be considered for VMEM hazards even without
"specific segment". Also, LDS DMA should be considered for LDS hazard
detection.
See also #137148
This PR adds a new combine to the `post-legalizer-combiner` pass. The
new combine checks for vectors being unmerged and subsequently padded
with `G_IMPLICIT_DEF` values by building a new vector. If such a case is
found, the vector being unmerged is instead just concatenated with a
`G_IMPLICIT_DEF` that is as wide as the vector being unmerged.
This removes unnecessary `mov` instructions in a few places.
This is in preparation for a patch that will only fold offsets into flat
instructions if their addition is inbounds. Marking the GEPs inbounds here
means that their output won't change with the later patch.
Basically a retry of the very similar PR #131994, as part of an updated stack
of PRs.
For SWDEV-516125.
This PR introduces `amdgpu-lower-exec-sync` pass which specifically
lowers named-barrier LDS globals introduced by #114550 .
Changes include:
- Moving the logic of lowering named-barrier LDS globals from
`amdgpu-lower-module-lds` pass to this new pass.
- This PR adds the pass to pipeline, remove the existing lowering logic for
named-barrier LDS in `amdgpu-lower-module-lds`
See #161827 for discussion on this topic.
This allows SDNodes to be validated against their expected type profiles
and reduces the number of changes required to add a new node.
Autogenerated node names start with "AMDGPUISD::", hence the changes in
the tests.
The few nodes defined in R600.td are *not* imported because TableGen
processes AMDGPU.td that doesn't include R600.td. Ideally, we would have
two sets of nodes, but that would require careful reorganization of td
files since some nodes are shared between AMDGPU/R600. Not sure if it
something worth looking into.
Some nodes fail validation, those are listed in
`AMDGPUSelectionDAGInfo::verifyTargetNode()`.
Part of #119709.
Pull Request: https://github.com/llvm/llvm-project/pull/168248
The main improvement is to the mfma tests. There are some
mild regressions scattered around, and a few major ones.
The worst regressions are in some of the bitcast tests;
these are cases where the SGPR argument list runs out
and uses VGPRs, and the copies-from-VGPR are misidentified
as divergent. Most of the shufflevector tests are also
regressions. These end up with cleaner MIR, but then get poor
regalloc decisions.
Some cases are relying on SIFixSGPRCopies to force VALU
reg_sequence inputs with SGPR inputs to use all VGPR inputs,
but this doesn't always happen if the reg_sequence isn't
invalid. Make sure we use a vgpr up-front here so we don't
rely on something later.
In true16 mode, D16 insts are lowered to a pseudo t16 first, and then
lowered to hi/lo inst in MC lowering using D16T16 table.
However, the D16T16 table selects both `flat_load_d16_t16 /
flat_load_d16_t16_saddr` to `flat_load_d16_(hi)_b16` which is wrong.
saddr pseudo inst `flat_load_d16_t16_saddr` should be selected to saddr
hi/lo inst
The global/scratch are correct while the flat seems to be the only one
with this issue.
InlineAsmLowering rejected inline assembly with memory reference inputs
if the values passed to the inline asm weren't pointers. The DAG
lowering however handled them just fine.
This patch updates InlineAsmLowering to store such values on the stack,
and then use the stack pointer as the "indirect" version of the operand.
AMDGPU: Really use AV classes by default for vector classes
Update getRegClassFor to use AV classes in place of VGPRs for
gfx90a-gfx950. There are a handful of regressions. Most are
enabling unprofitable rematerialization which reduce register
count by 1 but add an unnecessary instruction.
Refactor the XCnt optimization checks so that they can be checked when
applying a pre-existing waitcnt. This removes unnecessary xcnt waits
when taking a loop backedge.
AMDGPU: Start to use AV classes for unknown vector class
Use AGPR+VGPR superclasses for gfx90a+. The type used
for the class should be the broadest possible class, to
be contextually restricted later. InstrEmitter clamps these
to the common subclass of the context use instructions, so we're
best off using the broadest possible class for all types.
Note this does very little because we only use VGPR classes
for FP types (though this doesn't particularly make any sense),
and we legalize normal loads and stores to integer.
This PR implements parts of
https://github.com/llvm/llvm-project/issues/162376
- **Broader equivalence than constant index deltas**:
- Add Base-delta and Stride-delta matching for Add and GEP forms using
ScalarEvolution deltas.
- Reuse enabled for both constant and variable deltas when an available
IR value dominates the user.
- **Dominance-aware dictionary instead of linear scans**:
- Tuple-keyed candidate dictionary grouped by basic block.
- Walk the immediate-dominator chain to find the nearest dominating
basis quickly and deterministically.
- **Simple cost model and best-rewrite selection**:
- Score candidate expressions and rewrites; select the highest-profit
rewrite per instruction.
- Skip rewriting when expressions are already foldable or
high-efficiency.
- **Path compression for better ILP**:
- Compress chains of rewrites to a deeper dominating basis when a
constant delta exists along the path, reducing dependent bumps on
critical paths.
- **Dependency-aware rewrite ordering**:
- Build a dependency graph (basis, stride, variable delta producers) and
rewrite in topological order.
- This dependency graph will be needed by the next PR that adds partial
strength reduction.
This is in preparation for future changes in AMDGPU that will make more
substantial use of bundles pre-RA. For now, simply test this with
degenerate (single-instruction) bundles.
During init of unclustered schedule stage, minOccupancy may be
temporarily increased. But subsequently, if none of the regions are
scheduled because they don't meet the conditions of initGCNRegion,
minOccupancy remains incorrectly set. This patch avoids this
incorrectness by delaying the change of minOccupancy until a region is
about to be scheduled.
(#159884)
This eliminates the pseudo registerclasses used to hack the
wave register class, which are now replaced with RegClassByHwMode,
so most of the diff is from register class ID renumbering.
Allow widening up to 128-bit registers or if the new register class
is at least as large as one of the existing register classes.
This was artificially limiting. In particular this was doing the wrong
thing with sequences involving copies between VGPRs and AV registers.
Nearly all test changes are improvements.
The coalescer does not just widen registers out of nowhere. If it's
trying
to "widen" a register, it's generally packing a register into an
existing
register tuple, or in a situation where the constraints imply the wider
class anyway. 067a11015 addressed the allocation failure concern by
rejecting coalescing if there are no available registers. The original
change in a4e63ead4b didn't include a realistic testcase to judge if
this is harmful for pressure. I would expect any issues from this to
be of garden variety subreg handling issue. We could use more dynamic
state information here if it really is an issue.
I get the best results by removing this override completely. This is
a smaller step for patch splitting purposes.
Reasoning behind proposed change. This helps us move away from selecting
v_alignbits for fshr with uniform operands.
V_ALIGNBIT is defined in the ISA as:
D0.u32 = 32'U(({ S0.u32, S1.u32 } >> S2.u32[4 : 0]) & 0xffffffffLL)
Note: S0 carries the MSBs and S1 carries the LSBs of the value being
aligned.
I interpret that as : concat (s0, s1) >> S2, and use the 0X1F mask to
return the lower 32 bits.
fshr:
fshr i32 %src0, i32 %src1, i32 %src2
Where:
concat(%src0, %src1) represents the 64-bit value formed by %src0 as the
high 32 bits and %src1 as the low 32 bits.
%src2 is the shift amount.
Only the lower 32 bits are returned.
So these two are identical.
So, I can expand the V_ALIGNBIT through bit manipulation as:
Concat: S1 | (S0 << 32)
Shift: ((S1 | (S0 << 32)) >> S2)
Break the shift: (S1>>S2) | (S0 << (32 – S2)
The proposed pattern does exactly this.
Additionally, src2 in the fshr pattern should be:
* must be 0–31.
* If the shift is ≥32, hardware semantics differ; you must handle it
with extra instructions.
The extra S_ANDs limit the selection only to the last 5 bits