646 Commits

Author SHA1 Message Date
Akshat Oke
e91cbd4f29
[CodeGen][NPM] Port VirtRegRewriter to NPM (#130564) 2025-04-30 14:10:46 +05:30
Sergei Barannikov
bb1765179e
[TTI] Simplify implementation (NFCI) (#136674)
Replace "concept based polymorphism" with simpler PImpl idiom.

This pursues two goals:
* Enforce static type checking. Previously, target implementations hid
base class methods and type checking was impossible. Now that they
override the methods, the compiler will complain on mismatched
signatures.
* Make the code easier to navigate. Previously, if you asked your
favorite LSP server to show a method (e.g. `getInstructionCost()`), it
would show you methods from `TTI`, `TTI::Concept`, `TTI::Model`,
`TTIImplBase`, and target overrides. Now it is two less :)

There are three commits to hopefully simplify the review.

The first commit removes `TTI::Model`. This is done by deriving
`TargetTransformInfoImplBase` from `TTI::Concept`. This is possible
because they implement the same set of interfaces with identical
signatures.

The first commit makes `TargetTransformImplBase` polymorphic, which
means all derived classes should `override` its methods. This is done in
second commit to make the first one smaller. It appeared infeasible to
extract this into a separate PR because the first commit landed
separately would result in tons of `-Woverloaded-virtual` warnings (and
break `-Werror` builds).

The third commit eliminates `TTI::Concept` by merging it with the only
derived class `TargetTransformImplBase`. This commit could be extracted
into a separate PR, but it touches the same lines in
`TargetTransformInfoImpl.h` (removes `override` added by the second
commit and adds `virtual`), so I thought it may make sense to land these
two commits together.

Pull Request: https://github.com/llvm/llvm-project/pull/136674
2025-04-26 15:25:40 +03:00
Christudasan Devadasan
940108b24d
[AMDGPU][NewPM] Make the pass flow consistent with the legacy pipeline. (#136551) 2025-04-21 15:34:15 +05:30
Akshat Oke
84082223c8
[AMDGPU][NPM] Cleanup AMDGPUPassRegistry.def (#130071)
Finishing up AMDGPU specific passes. Only ones remaining are assembly
printer, virt reg rewriter and PEI.
2025-04-17 10:08:22 +05:30
Jun Wang
31f39c8325
[AMDGPU] Remove the AnnotateKernelFeatures pass (#130198)
Previously the AnnotateKernelFeatures pass infers two attributes:
amdgpu-calls and amdgpu-stack-objects, which are used to help determine
if flat scratch init is allowed. PR #118907 created the
amdgpu-no-flat-scratch-init attribute. Continuing with that work, this
patch makes use of this attribute to determine flat scratch init,
replacing amdgpu-calls and amdgpu-stack-objects. This also leads to the
removal of the AnnotateKernelFeatures pass.
2025-04-15 15:17:33 -07:00
Alex Voicu
1bcec036e1
[HIP][HIPSTDPAR][NFC] Re-order & adapt hipstdpar specific passes (#134753)
The `hipstdpar` specific passes were not ordered ideally, especially for
`fgpu-rdc` compilations, which meant that we'd eagerly run accelerator
code selection and remove symbols that might end up used. This change
corrects that aspect by ensuring that accelerator code selection is only
done after linking (this will have to be revisited in the future once
the closed-world assumption no longer holds). Furthermore, we take the
opportunity to move allocation interposition so that it properly gets
printed when print-pipeline-passes is requested. NFC.
2025-04-15 00:47:09 +03:00
Akshat Oke
b283ff7eb1
[CodeGen][NPM] Port BranchRelaxation to NPM (#130067)
This completes the PreEmitPasses.
2025-04-14 10:19:42 +05:30
Jeffrey Byrnes
6e7fe85247
[AMDGPU] Teach iterative schedulers about IGLP (#134953)
This adds IGLP mutation to the iterative schedulers
(`gcn-iterative-max-occupancy-experimental`, `gcn-iterative-minreg`, and
`gcn-iterative-ilp`).

The `gcn-iterative-minreg` and `gcn-iterative-ilp` schedulers never
actually applied the mutations added, so this also has the effect of
teaching them about mutations in general. The
`gcn-iterative-max-occupancy-experimental` scheduler has calls to
`ScheduleDAGMILive::schedule()`, so, before this, mutations were applied
at this point. Now this is done during calls to `BuildDAG`, with IGLP
superseding other mutations (similar to the other schedulers). We may
end up scheduling regions multiple times, with mutations being applied
each time, so we need to track for
`AMDGPU::SchedulingPhase::PreRAReentry`
2025-04-11 15:34:49 -07:00
Jeffrey Byrnes
5de3118c67
[AMDGPU] Make the iterative schedulers selectable via amdgpu-sched-strategy (#135042)
Currently, the only way for users to try these schedulers is via
`-misched=` . However, this overrides the default scheduler for all
targets. This causes problems for various toolchains / drivers which
spawn jobs for both x86 and AMDGPU -- e.g. hipcc. On the other hand,
`amdgpu-sched-strategy` only changes the scheduler for AMDGPU target.
2025-04-10 14:43:42 -07:00
Akshat Oke
2f6b06b264
[CodeGen][NPM] Port PostRAHazardRecognizer to NPM (#130066) 2025-04-09 16:36:22 +05:30
Akshat Oke
fcaefc2c19
[AMDGPU][NPM] Port SIPreEmitPeephole to NPM (#130065) 2025-04-08 17:58:48 +05:30
Rahul Joshi
a3754ade63
[NFC][LLVM][AMDGPU] Cleanup pass initialization for AMDGPU (#134410)
- Remove calls to pass initialization from pass constructors.
- https://github.com/llvm/llvm-project/issues/111767
2025-04-07 17:27:50 -07:00
Akshat Oke
a13a51b91f
[AMDGPU][NPM] Port AMDGPUSetWavePriority to NPM (#130064) 2025-04-02 16:28:05 +05:30
Akshat Oke
719b029c16
[AMDGPU][NPM] Port SILateBranchLowering to NPM (#130063) 2025-03-26 19:28:19 +05:30
Akshat Oke
f8e908a0ed
[AMDGPU][NPM] Port SIInsertHardClauses to NPM (#130062) 2025-03-25 15:33:32 +05:30
Akshat Oke
f10dc76f03
[AMDGPU][NPM] Port SIInsertWaitcnts to NPM (#130061) 2025-03-24 21:36:45 +05:30
Akshat Oke
6cc23faaac
[AMDGPU][NPM] Port AMDGPUMarkLastScratchLoad to NPM (#131738)
This finishes all passes for the optimized regalloc path.

---------

Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com>
2025-03-19 09:27:05 +05:30
Shilei Tian
51c706c119
[NFC][AMDGPU] Replace direct arch comparison with isAMDGCN() (#131357) 2025-03-14 14:21:44 -04:00
Akshat Oke
f34385dd1b
[AMDGPU][NPM] Port GCNCreateVOPD to NPM (#130059) 2025-03-14 10:22:45 +05:30
Petar Avramovic
3ad810ea9a
AMDGPU/GlobalISel: Disable LCSSA pass (#124297)
Disable LCSSA pass in preparation for implementing temporal divergence
lowering in amdgpu divergence lowering. Breaks all cases where sgpr or
i1 values are used outside of the cycle with divergent exit.
Regenerate regression tests for amdgpu divergence lowering with LCSSA
disabled.
Update IntrinsicLaneMaskAnalyzer to stop tracking lcssa phis that are
lane masks.
2025-03-12 11:09:50 +01:00
Akshat Oke
c22c5643db
[AMDGPU][NPM] Port SIMemoryLegalizer to NPM (#130060) 2025-03-12 14:30:35 +05:30
Matt Arsenault
0d2c55cb96
AMDGPU: Move enqueued block handling into clang (#128519)
The previous implementation wasn't maintaining a faithful IR
representation of how this really works. The value returned by
createEnqueuedBlockKernel wasn't actually used as a function, and
hacked up later to be a pointer to the runtime handle global
variable. In reality, the enqueued block is a struct where the first
field is a pointer to the kernel descriptor, not the kernel itself. We
were also relying on passing around a reference to a global using a
string attribute containing its name. It's better to base this on a
proper IR symbol reference during final emission.

This now avoids using a function attribute on kernels and avoids using
the additional "runtime-handle" attribute to populate the final
metadata. Instead, associate the runtime handle reference to the
kernel with the !associated global metadata. We can then get a final,
correctly mangled name at the end.

I couldn't figure out how to get rename-with-external-symbol behavior
using a combination of comdats and aliases, so leaves an IR pass to
externalize the runtime handles for codegen. If anything breaks, it's
most likely this, so leave avoiding this for a later step. Use a
special section name to enable this behavior. This also means it's
possible to declare enqueuable kernels in source without going through
the dedicated block syntax or other dedicated compiler support.

We could move towards initializing the runtime handle in the
compiler/linker. I have a working patch where the linker sets up the
first field of the handle, avoiding the need to export the block
kernel symbol for the runtime. We would need new relocations to get
the private and group sizes, but that would avoid the runtime's
special case handling that requires the device_enqueue_symbol metadata
field.

https://reviews.llvm.org/D141700
2025-03-10 19:54:04 +07:00
Akshat Oke
52225d2702
[AMDGPU][NewPM] Port AMDGPUReserveWWMRegs to NPM (#123722) 2025-03-10 17:36:35 +05:30
Akshat Oke
6c87ec4f4d
[AMDGPU][NPM] Port SIModeRegister to NPM (#129014) 2025-03-04 10:51:03 +05:30
Akshat Oke
852923822f
[AMDGPU][NewPM] Port AMDGPUInsertDelayAlu to NPM (#128003) 2025-02-26 09:50:09 +05:30
Akshat Oke
bd16a87d05
[AMDGPU][NewPM] Port SIPostRABundler to NPM (#123717) 2025-02-21 16:05:58 +05:30
Akshat Oke
9855d761f3
[AMDGPU][NewPM] Port SIOptimizeExecMaskingPreRA to NPM (#125351) 2025-02-20 17:35:56 +05:30
Vikram Hegde
663db5c70d
[AMDGPU][NewPM] Port GCNNSAReassign pass to new pass manager (#125034)
tests to be added while porting virtregrewrite and greedy regalloc
2025-02-18 11:13:31 +05:30
Scott Linder
29ca3b8b28
[AMDGPU] Push amdgpu-preload-kern-arg-prolog after livedebugvalues (#126148)
This is effectively a workaround for a bug in livedebugvalues, but seems
to potentially be a general improvement, as BB sections seems like it
could ruin the special 256-byte prelude scheme that
amdgpu-preload-kern-arg-prolog requires anyway. Moving it even later
doesn't seem to have any material impact, and just adds livedebugvalues
to the list of things which no longer have to deal with pseudo
multiple-entry functions.

AMDGPU debug-info isn't supported upstream yet, so the bug being avoided
isn't testable here. I am posting the patch upstream to avoid an
unnecessary diff with AMD's fork.
2025-02-17 13:29:56 -05:00
Vikram Hegde
06a3abd9e8
[AMDGPU][NewPM] Port "SIFormMemoryClauses" to NPM (#127181) 2025-02-17 11:07:17 +05:30
Akshat Oke
7b60e03d73
Reland "CodeGen][NewPM] Port MachineScheduler to NPM. (#125703)" (#126684)
`RegisterClassInfo` was supposed to be kept alive between pass runs,
which wasn't being done leading to recomputations increasing the compile
time.

Now the Impl class is a member of the legacy and new passes so that it
is not reconstructed on every pass run.

---------

Co-authored-by: Christudasan Devadasan <christudasan.devadasan@amd.com>
2025-02-12 18:54:39 +05:30
Vikram Hegde
9c725ef368
[AMDGPU][NewPM] Port "GCNRewritePartialRegUses" pass to NPM (#126024) 2025-02-12 11:21:40 +05:30
Vikram Hegde
3293bff5d2
[AMDGPU][NewPM] Port "GCNPreRAOptimizations" pass to NPM (#126040) 2025-02-11 11:09:38 +05:30
Shilei Tian
bde8ce6a5c
[AMDGPU] Only run AMDGPUPrintfRuntimeBindingPass at non-prelink phase (#125162) 2025-02-10 08:24:50 -05:00
Akshat Oke
564b9b7f4d
Revert "CodeGen][NewPM] Port MachineScheduler to NPM. (#125703)" (#126268)
This reverts commit 5aa4979c47255770cac7b557f3e4a980d0131d69 while I
investigate what's causing the compile-time regression.
2025-02-08 15:36:48 +05:30
Christudasan Devadasan
814db6c53f
[CodeGen][NewPM] Port GCNPreRALongBranchReg to NPM. (#125844) 2025-02-05 18:46:27 +05:30
Christudasan Devadasan
b83c960bad
[CodeGen][NewPM] Port SIWholeQuadMode to NPM. (#125833) 2025-02-05 18:44:57 +05:30
Christudasan Devadasan
5aa4979c47
CodeGen][NewPM] Port MachineScheduler to NPM. (#125703) 2025-02-05 12:17:59 +05:30
Christudasan Devadasan
a47c35a699
[CodeGen] Move MISched target hooks into TargetMachine (#125700)
The createSIMachineScheduler & createPostMachineScheduler
target hooks are currently placed in the PassConfig interface.
Moving it out to TargetMachine so that both legacy and
the new pass manager can effectively use them.
2025-02-05 11:41:37 +05:30
Carl Ritson
a3a3e6997b
[AMDGPU] Rewrite GFX12 SGPR hazard handling to dedicated pass (#118750)
- Algorithm operates over whole IR to attempt to minimize waits.
- Add support for VALU->VALU SGPR hazards via VA_SDST/VA_VCC.
2025-01-30 11:21:11 +09:00
Joel E. Denny
18f8106f31
[KernelInfo] Implement new LLVM IR pass for GPU code analysis (#102944)
This patch implements an LLVM IR pass, named kernel-info, that reports
various statistics for codes compiled for GPUs. The ultimate goal of
these statistics to help identify bad code patterns and ways to mitigate
them. The pass operates at the LLVM IR level so that it can, in theory,
support any LLVM-based compiler for programming languages supporting
GPUs. It has been tested so far with LLVM IR generated by Clang for
OpenMP offload codes targeting NVIDIA GPUs and AMD GPUs.

By default, the pass runs at the end of LTO, and options like
``-Rpass=kernel-info`` enable its remarks. Example `opt` and `clang`
command lines appear in `llvm/docs/KernelInfo.rst`. Remarks include
summary statistics (e.g., total size of static allocas) and individual
occurrences (e.g., source location of each alloca). Examples of its
output appear in tests in `llvm/test/Analysis/KernelInfo`.
2025-01-29 12:40:19 -05:00
Chaitanya
3c79a04cc2
[AMDGPU] Add amdgpu-sw-lower-lds pass to NPM codegen addIRPasses. (#124102)
This PR adds amdgpu-sw-lower-lds pass to
AMDGPUCodeGenPassBuilder::addIRPasses()
2025-01-24 11:15:30 +05:30
Lucas Ramirez
6206f5444f
[AMDGPU] Occupancy w.r.t. workgroup size range is also a range (#123748)
Occupancy (i.e., the number of waves per EU) depends, in addition to
register usage, on per-workgroup LDS usage as well as on the range of
possible workgroup sizes. Mirroring the latter, occupancy should
therefore be expressed as a range since different group sizes generally
yield different achievable occupancies.

`getOccupancyWithLocalMemSize` currently returns a scalar occupancy
based on the maximum workgroup size and LDS usage. With respect to the
workgroup size range, this scalar can be the minimum, the maximum, or
neither of the two of the range of achievable occupancies. This commit
fixes the function by making it compute and return the range of
achievable occupancies w.r.t. workgroup size and LDS usage; it also
renames it to `getOccupancyWithWorkGroupSizes` since it is the range of
workgroup sizes that produces the range of achievable occupancies.

Computing the achievable occupancy range is surprisingly involved.
Minimum/maximum workgroup sizes do not necessarily yield maximum/minimum
occupancies i.e., sometimes workgroup sizes inside the range yield the
occupancy bounds. The implementation finds these sizes in constant time;
heavy documentation explains the rationale behind the sometimes
relatively obscure calculations.

As a justifying example, consider a target with 10 waves / EU, 4 EUs/CU,
64-wide waves. Also consider a function with no LDS usage and a flat
workgroup size range of [513,1024].

- A group of 513 items requires 9 waves per group. Only 4 groups made up
of 9 waves each can fit fully on a CU at any given time, for a total of
36 waves on the CU, or 9 per EU. However, filling as much as possible
the remaining 40-36=4 wave slots without decreasing the number of groups
reveals that a larger group of 640 items yields 40 waves on the CU, or
10 per EU.
- Similarly, a group of 1024 items requires 16 waves per group. Only 2
groups made up of 16 waves each can fit fully on a CU ay any given time,
for a total of 32 waves on the CU, or 8 per EU. However, removing as
many waves as possible from the groups without being able to fit another
equal-sized group on the CU reveals that a smaller group of 896 items
yields 28 waves on the CU, or 7 per EU.

Therefore the achievable occupancy range for this function is not [8,9]
as the group size bounds directly yield, but [7,10].

Naturally this change causes a lot of test churn as instruction
scheduling is driven by achievable occupancy estimates. In most unit
tests the flat workgroup size range is the default [1,1024] which,
ignoring potential LDS limitations, would previously produce a scalar
occupancy of 8 (derived from 1024) on a lot of targets, whereas we now
consider the maximum occupancy to be 10 in such cases. Most tests are
updated automatically and checked manually for sanity. I also manually
changed some non-automatically generated assertions when necessary.

Fixes #118220.
2025-01-23 16:07:57 +01:00
Matt Arsenault
93d35ad5f5
AMDGPU: Delete FillMFMAShadowMutation (#123861)
No test changes with this removed and it appears to
be obsolete.
2025-01-22 22:41:25 +07:00
Akshat Oke
a343b8e595
[AMDGPU][NewPM] Port SILowerWWMCopies to NPM (#123695) 2025-01-22 14:54:01 +05:30
Akshat Oke
9b6e8df896
[AMDGPU][NewPM] Port SIFixVGPRCopies to NPM (#123592)
Extends NPM pipeline support till PostRegAlloc passes (greedy is in the
works)
2025-01-21 15:27:46 +05:30
Akshat Oke
96c4f978d0
[AMDGPU][NewPM] Port SIOptimizeExecMasking to NPM (#123572) 2025-01-20 16:34:01 +05:30
Christudasan Devadasan
1797fb6b23
[AMDGPU][NewPM] Port SILowerControlFlow pass into NPM. (#123045) 2025-01-16 11:06:38 +05:30
Akshat Oke
73b0e8a191
[AMDGPU][NewPM] Port AMDGPUOpenCLEnqueuedBlockLowering to NPM (#122434) 2025-01-13 17:52:30 +05:30
Akshat Oke
f431f93a77
[CodeGen][NewPM] Use proper NPM AtomicExpandPass in AMDGPU (#122086)
`PassRegistry.def` already has this entry, but the dummy definition was
being pulled instead.

I couldn't reproduce the build failures that FIXME referenced, maybe the
Dummy pass getting in the way was part of the cause.
2025-01-13 10:38:24 +05:30