…ions
When a merge-like instruction has all readanylane sources and the result
is copied to VGPRs, eliminate the readanylanes by either using the
original unmerge source directly or building a new merge with the VGPR
sources.
With recent refactoring, LDS promotion worklists for all allocas are
populated upfront. In some cases, this results in a User in multiple
lists. Then as each list is processed, a User might get deleted via
removeFromParent, potentially leaving a dangling pointer in a subsequent
worklist.
Currently this only occurs for memcpy and memmove. Prior to refactoring,
these were handled by DeferredInstr, and were processed after the last
use of the then singular worklist.
This change moves processing of DeferredInstr to after all worklists
have be processed.
The pass now contains a non-fp expansion and should
be used for any similar expansions regardless of the
types involved. Hence a generic name seems apt.
Rename the source files, pass, and adjust the pass
description. Move all tests for the expansions
that have previously been merged into the pass
to a single directory.
Both passes expand instructions at the IR level.
They use the same kind of instruction visitation
logic and contain significant code duplication e.g.
for scalarization.
On GFX12+, GLOBAL_INV increments the loadcnt counter but does not write
results to any VGPRs. Previously, we unconditionally inserted
s_wait_loadcnt 0 at function returns even when the only pending loadcnt
was from GLOBAL_INV instructions.
This patch optimizes waitcnt insertion by skipping the loadcnt wait at
function boundaries when no VGPRs have pending loads. This is determined
by checking if any VGPR has a score greater than the lower bound for
LOAD_CNT - if not, the pending loadcnt must be from non-VGPR-writing
instructions like GLOBAL_INV.
The optimization is limited to GFX12+ targets where GLOBAL_INV exists
and uses the extended wait count instructions.
This is a follow-up optimization to PR #135340 which added tracking for
GLOBAL_INV in the waitcnt pass.
If the sign bit of the denominator is known 0, do not emit the fabs.
Also, extend this to handle min/max with fabs inputs.
I originally tried to do this as the general combine on fabs, but
it proved to be too much trouble at this time. This is mostly
complexity introduced by expanding the various min/maxes into
canonicalizes, and then not being able to assume the sign bit
of canonicalize (fabs x) without nnan.
This defends against future code size regressions in the atan2 and
atan2pi library functions.
The optimization crashed attempting to fix a fold of a COPY $exec
instruction into a use in an INLINEASM instruction because it attempts
to call isOperandLegal which crashes since the index is out of the
MCInstrDesc's operands array bounds.
Change SIOptimizeExecMaskingPreRA to skip the optimization if the
operand index is out of bounds.
---------
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
On gfx12+, the unified` s_barrier` is lowered to split
`s_barrier_signal/s_barrier_wait` pairs. By default, the dependency edge
between signal and wait has zero latency, causing the scheduler to emit
them adjacent to each other. This misses the opportunity to hide barrier
latency.
This patch adds synthetic latency to the signal-wait barrier edge to
encourage latency hiding. Independent instructions are scheduled in the
gap between split barrier signal and wait.
The latency is tunable via -amdgpu-barrier-signal-wait-latency.
Fixes: SWDEV-567090
PHI-node part was merged with PR#160909.
Extend `isOpLegal` to treat 8/16-bit vector add/sub/and/or/xor as
profitable on SDWA targets (stores and intrinsics remain profitable).
This repacks loop-carried values to i32 across BBs and restores SDWA
lowering instead of scattered lshr/lshl/or sequences.
Testing:
- Local: `check-llvm-codegen-amdgpu` is green (4314/4320 passed, 6
XFAIL).
- Additional: validated in AMD internal CI
RegBankLegalize using trivial mapping helper, assigns same reg bank
to all operands, vgpr or sgpr.
Uncovers multiple codegen and regbank combiner regressions related to
looking through sgpr to vgpr copies.
Skip regbankselect-concat-vector.mir since agprs are not yet supported.
A buildbot failed for the original patch.
https://github.com/llvm/llvm-project/pull/171835 addresses the issue
raised by the buildbot.
After the fix is merged, the original patch is reapplied without any
change.
PHIs that are larger than a legal integer type are split into multiple
virtual registers that are numbered sequentially. We can propagate the
known bits for each of these registers individually.
Big endian is not supported yet because the register order needs to be
reversed.
Fixes#171671
This patch adds register bank legalization support for buffer load byte
and short operations in the AMDGPU GlobalISel pipeline.
This is a re-land of #167798. I have fixed the failing test
/CodeGen/AMDGPU/GlobalISel/buffer-load-byte-short.ll
A buildbot failure in https://github.com/llvm/llvm-project/pull/170323
when expensive checks were used highlighted that some of these patterns
were missing.
This patch adds `V_INDIRECT_REG_{READ/WRITE}_GPR_IDX` and
`V/S_INDIRECT_REG_WRITE_MOVREL` for `V6` and `V7` vector sizes.
Fixed a crash in Blender due to some weird control flow.
The issue was with the "merge" function which was only looking at the
keys of the "Other" VMem/SGPR maps. It needs to look at the keys of both
maps and merge them.
Original commit message below
----
The pass was already "reinventing" the concept just to deal with 16 bit
registers. Clean up the entire tracking logic to only use register
units.
There are no test changes because functionality didn't change, except:
- We can now track more LDS DMA IDs if we need it (up to `1 << 16`)
- The debug prints also changed a bit because we now talk in terms of
register units.
This also changes the tracking to use a DenseMap instead of a massive
fixed size table. This trades a bit of access speed for a smaller memory
footprint. Allocating and memsetting a huge table to zero caused a
non-negligible performance impact (I've observed up to 50% of the time
in the pass spent in the `memcpy` built-in on a big test file).
I also think we don't access these often enough to really justify using
a vector. We do a few accesses per instruction, but not much more. In a
huge 120MB LL file, I can barely see the trace of the DenseMap accesses.
This change is motivated by the overall goal of finding alternative ways
to promote allocas to VGPRs. The current solution is effectively limited
to allocas whose size matches a register class, and we can't keep adding
more register classes. We have some downstream work in this direction,
and I'm currently looking at cleaning that up to bring it upstream.
This refactor paves the way to adding a third way of promoting allocas,
on top of the existing alloca-to-vector and alloca-to-LDS. Much of the
analysis can be shared between the different promotion techniques.
Additionally, the idea behind splitting the pass into an analysis
phase and a commit phase is that it ought to allow us to more easily
make
better "big picture" decision about which allocas to promote how in the
future.
Two S_WAITCNT_DEPCTR instructions are constructed with hardcoded operand
values. Replace these with appropriate calls to
AMDGPU::DepCtr::encodeFieldVmVsrc().
NFC, except that the original code was setting reserved operand bits
that should-be-zero, and this is now corrected.
Before this patch, `insertelement/extractelement` with dynamic indices
would
fail to select with `-O0` for vector 32-bit element types with sizes 3,
5, 6 and 7,
which did not map to a `SI_INDIRECT_SRC/DST` pattern.
Other "weird" sizes bigger than 8 (like 13) are properly handled
already.
To solve this issue we add the missing patterns for the problematic
sizes.
Solves SWDEV-568862
This is a followup to https://github.com/llvm/llvm-project/pull/171114,
removing the handling for most libcalls that are already canonicalized
to intrinsics in the middle-end. The only remaining one is fabs, which
has more test coverage than the others.
If the subtarget supports flat scratch SVS mode and there is no SGPR
available to replace a frame index, convert a scratch instruction in SS
form into SV form and replace the frame index with a scavenged VGPR.
Resolves#155902
Co-authored-by: Matt Arsenault <matthew.arsenault@amd.com>
The pass was already "reinventing" the concept just to deal with 16 bit
registers. Clean up the entire tracking logic to only use register
units.
There are no test changes because functionality didn't change, except:
- We can now track more LDS DMA IDs if we need it (up to `1 << 16`)
- The debug prints also changed a bit because we now talk in terms of
register units.
This also changes the tracking to use a DenseMap instead of a massive
fixed size table. This trades a bit of access speed for a smaller memory
footprint. Allocating and memsetting a huge table to zero caused a
non-negligible performance impact (I've observed up to 50% of the time
in the pass spent in the `memcpy` built-in on a big test file).
I also think we don't access these often enough to really justify using
a vector. We do a few accesses per instruction, but not much more. In a
huge 120MB LL file, I can barely see the trace of the DenseMap accesses.
Some floating-point optimization don't trigger because they can produce
incorrect results around signed zeros, and rely on the existence of the
nsz flag which commonly appears when fast-math is enabled.
However, this flag is not a hard requirement when all of the users of
the combined value are either guaranteed to overwrite the sign-bit or
simply ignore it (comparisons, etc.).
The optimizations affected:
- fadd x, +0.0 -> x
- fsub x, -0.0 -> x
- fsub +0.0, x -> fneg x
- fdiv(x, sqrt(x)) -> sqrt(x)
- frem lowering with power-of-2 divisors
This PR implements parts of
https://github.com/llvm/llvm-project/issues/162376
- **Broader equivalence than constant index deltas**:
- Add Base-delta and Stride-delta matching for Add and GEP forms using
ScalarEvolution deltas.
- Reuse enabled for both constant and variable deltas when an available
IR value dominates the user.
- **Dominance-aware dictionary instead of linear scans**:
- Tuple-keyed candidate dictionary grouped by basic block.
- Walk the immediate-dominator chain to find the nearest dominating
basis quickly and deterministically.
- **Simple cost model and best-rewrite selection**:
- Score candidate expressions and rewrites; select the highest-profit
rewrite per instruction.
- Skip rewriting when expressions are already foldable or
high-efficiency.
- **Path compression for better ILP**:
- Compress chains of rewrites to a deeper dominating basis when a
constant delta exists along the path, reducing dependent bumps on
critical paths.
- **Dependency-aware rewrite ordering**:
- Build a dependency graph (basis, stride, variable delta producers) and
rewrite in topological order.
- This dependency graph will be needed by the next PR that adds partial
strength reduction.
- **Correctness enhencment**
- Fix a correctness issue that reusing instructions with the same SCEV
may introduce poison.
---------
Co-authored-by: Kazu Hirata <kazu@google.com>
Port AMDGPUArgumentUsageInfo analysis to the NPM to fix suboptimal code
generation when NPM is enabled by default.
Previously, DAG.getPass() returns nullptr when using NPM, causing the
argument usage info to be unavailable during ISel. This resulted in
fallback to FixedABIFunctionInfo which assumes all implicit arguments
are needed, generating unnecessary register setup code for entry
functions.
Fixes LLVM::CodeGen/AMDGPU/cc-entry.ll
Changes:
- Split AMDGPUArgumentUsageInfo into a data class and NPM analysis
wrapper
- Update SIISelLowering to use DAG.getMFAM() for NPM path
- Add RequireAnalysisPass in addPreISel() to ensure analysis
availability
This follows the same pattern used for PhysicalRegisterUsageInfo.