AsmPrinter needs to be split into three passes (begin, per MF, end) to
avoid the need to materialize all machine functions at the same time.
Update the CodeGenPassBuilder hooks for this.
Reviewers: aeubanks, paperchalice, arsenm
Pull Request: https://github.com/llvm/llvm-project/pull/182795
Otherwise we cannot create an MCStreamer without getting MMI, which we
cannot do until we have started running AsmPrinter without also plumbing
MMI through CodeGenPassBuilder.
Reviewers: arsenm, paperchalice, aeubanks
Pull Request: https://github.com/llvm/llvm-project/pull/182794
`AMDGPUArgumentUsageInfo` provided a per-function map that
`lowerFormalArguments` would write each function's implicit argument
register layout into, and `passSpecialInputs` would read back when
lowering calls to look up the callee's layout. This per-function map is
redundant for all non-entry callees, which already use the same
`FixedABIFunctionInfo` register layout.
GlobalISel already used `FixedABIFunctionInfo` unconditionally. This
change makes SelectionDAG do the same.
Improve waitcnt merging in ML kernel loops by increasing latencies on
VALU writes to SGPRs.
Specifically this helps with the case of V_CMP output feeding V_CNDMASK
instructions.
This implements the major piece of
https://discourse.llvm.org/t/rfc-codegen-new-pass-manager-pipeline-construction-design/84659,
making it explicit when we break the function pipeline up.
We essentially get rid of the AddPass and AddMachinePass helpers and
replace them with explicit functions for the pass types. The user then
needs to explicitly call flushFPMstoMPM before breaking.
This is sort of a hybrid of the current construction and what the RFC
proposed. The alternative would be passing around FunctionPassManagers
and having the pipeline actually explicitly constructed. I think this
compromises ergonomics slightly (needing to pass a FPM in many more
places). It is also nice to assert that the function pass manager is
empty when adding a module pass, which is easier when CodegenPassBuilder
owns the FPM and MFPM.
Port AMDGPUArgumentUsageInfo analysis to the NPM to fix suboptimal code
generation when NPM is enabled by default.
Previously, DAG.getPass() returns nullptr when using NPM, causing the
argument usage info to be unavailable during ISel. This resulted in
fallback to FixedABIFunctionInfo which assumes all implicit arguments
are needed, generating unnecessary register setup code for entry
functions.
Fixes LLVM::CodeGen/AMDGPU/cc-entry.ll
Changes:
- Split AMDGPUArgumentUsageInfo into a data class and NPM analysis
wrapper
- Update SIISelLowering to use DAG.getMFAM() for NPM path
- Add RequireAnalysisPass in addPreISel() to ensure analysis
availability
This follows the same pattern used for PhysicalRegisterUsageInfo.
This hasn't been strictly necessary since c897c13dde.
Practically this makes little difference; we still enable IPRA
by default which implies this option. By removing this explicit
force, -enable-ipra=0 has the expected change in the pass pipeline
to remove the DummyCGSCC runs.
- Support serialization of the number of allocated preload kernarg SGPRs
- Support serialization of the first preload kernarg SGPR allocated
Together they enable reconstructing correctly MIR with preload kernarg
SGPRs.
Do not add latency for wavefront and singlethread scope fences during
barrier latency DAG mutation.
These scopes do not typically introduce any latency and adjusting
schedules based on them significantly impacts latency hiding.
This PR introduces `amdgpu-lower-exec-sync` pass which specifically
lowers named-barrier LDS globals introduced by #114550 .
Changes include:
- Moving the logic of lowering named-barrier LDS globals from
`amdgpu-lower-module-lds` pass to this new pass.
- This PR adds the pass to pipeline, remove the existing lowering logic for
named-barrier LDS in `amdgpu-lower-module-lds`
See #161827 for discussion on this topic.
Update all uses of variadic `.Cases` to use the initializer list
overload instead. I plan to mark variadic `.Cases` as deprecated in a
followup PR.
For more context, see https://github.com/llvm/llvm-project/pull/163117.
This PR enables AMDGPUUniformIntrinsicCombine pass in the llc pipeline.
Also introduces the "amdgpu-uniform-intrinsic-combine" command-line flag
to enable/disable the pass.
see the PR:https://github.com/llvm/llvm-project/pull/116953
There has been an issue(using function analysis inside the module pass
in OPM) integrating this pass into the LLC pipeline, which currently
lacks NPM support. I tried finding a way to get the per-function
analysis, but it seems that in OPM, we don't have that option.
So the best approach would be to make it a function pass.
Ref: https://github.com/llvm/llvm-project/pull/116953
Add scheduler DAG mutation to add data dependencies between atomic
fences and preceding memory reads. This allows some modelling of the
impact an atomic fence can have on outstanding memory accesses.
This is beneficial when a fence would cause wait count insertion, as
more instructions will be scheduled before the fence hiding memory
latency.
This pass introduces optimizations for AMDGPU intrinsics by leveraging
the uniformity of their arguments. When an intrinsic's arguments are
detected as uniform, redundant computations are eliminated, and the
intrinsic calls are simplified accordingly.
By utilizing the UniformityInfo analysis, this pass identifies cases
where intrinsic calls are uniform across all lanes, allowing
transformations that reduce unnecessary operations and improve the IR's
efficiency.
These changes enhance performance by streamlining intrinsic usage in
uniform scenarios without altering the program's semantics.
For background, see PR #99878
This is a follow-up for #162207, where I neglected to skip the second
use of AMDGPUAttributor for R600 targets. This use is covered by the
test lld/test/ELF/lto/r600.ll.
These passes call `getSubtarget<GCNSubtarget>`, which doesn't work on
R600 targets, as that uses an `R600Subtarget` type, instead.
Unfortunately, `TargetMachine::getSubtarget<ST>` does an unchecked
static_cast to `ST&`, which makes it easy for this error to go
undetected.
The modifications here were verified by running check-llvm with an
assert added to getSubtarget. However, that asssert requires that RTTI
is enabled, which LLVM doesn't use, so I've reverted the assert before
sending this fix upstream.
These errors have been present for some time, but were detected after
#162040 caused an uninitialized memory read to be reported by asan/msan.
Clang and other frontends generally need the LLVM data layout string in
order to generate LLVM IR modules for LLVM. MLIR clients often need it
as well, since MLIR users often lower to LLVM IR.
Before this change, the LLVM datalayout string was computed in the
LLVM${TGT}CodeGen library in the relevant TargetMachine subclass.
However, none of the logic for computing the data layout string requires
any details of code generation. Clients who want to avoid duplicating
this information were forced to link in LLVMCodeGen and all registered
targets, leading to bloated binaries. This happened in PR #145899,
which measurably increased binary size for some of our users.
By moving this information to the TargetParser library, we
can delete the duplicate datalayout strings in Clang, and retain the
ability to generate IR for unregistered targets.
This is intended to be a very mechanical LLVM-only change, but there is
an immediately obvious follow-up to clang, which will be prepared
separately.
The vast majority of data layouts are computable with two inputs: the
triple and the "ABI name". There is only one exception, NVPTX, which has
a cl::opt to enable short device pointers. I invented a "shortptr" ABI
name to pass this option through the target independent interface.
Everything else fits. Mips is a bit awkward because it uses a special
MipsABIInfo abstraction, which includes members with codegen-like
concepts like ABI physical registers that can't live in TargetParser. I
think the string logic of looking for "n32" "n64" etc is reasonable to
duplicate. We have plenty of other minor duplication to preserve
layering.
---------
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
Co-authored-by: Sergei Barannikov <barannikov88@gmail.com>
Let's do the lowering of non-split into split barriers in a new IR pass,
AMDGPULowerIntrinsics. That way, there is no code duplication between
SelectionDAG and GlobalISel. This simplifies some upcoming extensions to
the code.
When compiling in `--hipstdpar` mode, the builtins corresponding to the
standard library might end up in code that is expected to execute on the
accelerator (e.g. by using the `std::` prefixed functions from
`<cmath>`). We do not have uniform handling for this in AMDGPU, and the
errors that obtain are quite arcane. Furthermore, the user-space changes
required to work around this tend to be rather intrusive.
This patch adds an additional `--hipstdpar` specific pass which forwards
to the run time component of HIPSTDPAR the intrinsics / libcalls which
result from the use of the math builtins, and which are not properly
handled. In the long run we will want to stop relying on this and handle
things in the compiler, but it is going to be a rather lengthy journey,
which makes this medium term escape hatch necessary.
The paired change in the run time component is here
<https://github.com/ROCm/rocThrust/pull/551>.
If we have a v_mov_b32 or v_accvgpr_write_b32 with an inline immediate,
replace it with a pseudo which writes to the combined AV_* class. This
relaxes the operand constraints, which will allow the allocator to
inflate the register class to AV_* to potentially avoid spilling.
The allocator does not know how to replace an instruction to enable
the change of register class. I originally tried to do this by changing
all of the places we introduce v_mov_b32 with immediate, but it's along
tail of niche cases that require manual updating. Plus we can restrict
this to only run on functions where we know we will be allocating AGPRs.
"Required" passes relate to actually running the pass on the IR,
regardless of whether they are in the pipeline.
CGPassBuilder was mistakenly still adding them to the pipeline.
The test `llc -stop-after=greedy -enable-new-pm` would still add
`greedy` to the pipeline otherwise.
In gfx90a-gfx950, it's possible to emit MFMAs which use AGPRs or VGPRs
for vdst and src2. We do not want to do use the AGPR form, unless
required by register pressure as it requires cross bank register
copies from most other instructions. Currently we select the AGPR
or VGPR version depending on a crude heuristic for whether it's possible
AGPRs will be required. We really need the register allocation to
be complete to make a good decision, which is what this pass is for.
This adds the pass, but does not yet remove the selection patterns
for AGPRs. This is a WIP, and NFC-ish. It should be a no-op on any
currently selected code. It also does not yet trigger on the real
examples of interest, which require handling batches of MFMAs at
once.