Move canGuaranteeTCO and mayTailCallThisCC into AMDGPUBaseInfo instead
of keeping two copies for DAG/Global ISel.
Also remove isKernelCC, which doesn't agree with isKernel and doesn't
seem very useful.
While at it, also move all the CC-related helpers into AMDGPUBaseInfo.h and
mark them constexpr.
We rename `createGenericSchedLive` and `createGenericSchedPostRA`
to `createSchedLive` and `createSchedPostRA`, and add a template
parameter `Strategy` which is the generic implementation by default.
This can simplify some code for targets that have custom scheduler
strategy.
It will run twice in the non-LTO pipeline with `O1` or higher. In LTO post link pipeline, it will be run once with `O2` or higher, since inline and SROA don't run in `O1`.
Of the 128-bits of buffer descriptor only 48 bits are address bits, so
following the discussion on https://discourse.llvm.org/t/clarifiying-the-semantics-of-ptrtoint/83987/54,
the logic conclusion is to set the index width to 48 bits instead of
the current value of 128.
Most of the test changes are mechanical datalayout updates, but there
is one actual change: the ptrmask test now uses .i48 instead of .i128
and I had to update SelectionDAGBuilder to correctly extend the mask.
Reviewed By: krzysz00
Pull Request: https://github.com/llvm/llvm-project/pull/139419
Moves kernarg preload logic to its own module pass. Cloned function
declarations are removed when preloading hidden arguments. The inreg
attribute is now added in this pass instead of AMDGPUAttributor. The
rest of the logic is copied from AMDGPULowerKernelArguments which now
only check whether an arguments is marked inreg to avoid replacing
direct uses of preloaded arguments. This change requires test updates to
remove inreg from lit tests with kernels that don't actually want
preloading.
Register assembly printer passes in the pass registry.
This makes it possible to use `llc -start-before=<target>-asm-printer ...` in tests.
Adds a `char &ID` parameter to the AssemblyPrinter constructor to allow
targets to use the `INITIALIZE_PASS` macros and register the pass in the
pass registry. This currently has a default parameter so it won't break
any targets that have not been updated.
Replace "concept based polymorphism" with simpler PImpl idiom.
This pursues two goals:
* Enforce static type checking. Previously, target implementations hid
base class methods and type checking was impossible. Now that they
override the methods, the compiler will complain on mismatched
signatures.
* Make the code easier to navigate. Previously, if you asked your
favorite LSP server to show a method (e.g. `getInstructionCost()`), it
would show you methods from `TTI`, `TTI::Concept`, `TTI::Model`,
`TTIImplBase`, and target overrides. Now it is two less :)
There are three commits to hopefully simplify the review.
The first commit removes `TTI::Model`. This is done by deriving
`TargetTransformInfoImplBase` from `TTI::Concept`. This is possible
because they implement the same set of interfaces with identical
signatures.
The first commit makes `TargetTransformImplBase` polymorphic, which
means all derived classes should `override` its methods. This is done in
second commit to make the first one smaller. It appeared infeasible to
extract this into a separate PR because the first commit landed
separately would result in tons of `-Woverloaded-virtual` warnings (and
break `-Werror` builds).
The third commit eliminates `TTI::Concept` by merging it with the only
derived class `TargetTransformImplBase`. This commit could be extracted
into a separate PR, but it touches the same lines in
`TargetTransformInfoImpl.h` (removes `override` added by the second
commit and adds `virtual`), so I thought it may make sense to land these
two commits together.
Pull Request: https://github.com/llvm/llvm-project/pull/136674
Previously the AnnotateKernelFeatures pass infers two attributes:
amdgpu-calls and amdgpu-stack-objects, which are used to help determine
if flat scratch init is allowed. PR #118907 created the
amdgpu-no-flat-scratch-init attribute. Continuing with that work, this
patch makes use of this attribute to determine flat scratch init,
replacing amdgpu-calls and amdgpu-stack-objects. This also leads to the
removal of the AnnotateKernelFeatures pass.
The `hipstdpar` specific passes were not ordered ideally, especially for
`fgpu-rdc` compilations, which meant that we'd eagerly run accelerator
code selection and remove symbols that might end up used. This change
corrects that aspect by ensuring that accelerator code selection is only
done after linking (this will have to be revisited in the future once
the closed-world assumption no longer holds). Furthermore, we take the
opportunity to move allocation interposition so that it properly gets
printed when print-pipeline-passes is requested. NFC.
This adds IGLP mutation to the iterative schedulers
(`gcn-iterative-max-occupancy-experimental`, `gcn-iterative-minreg`, and
`gcn-iterative-ilp`).
The `gcn-iterative-minreg` and `gcn-iterative-ilp` schedulers never
actually applied the mutations added, so this also has the effect of
teaching them about mutations in general. The
`gcn-iterative-max-occupancy-experimental` scheduler has calls to
`ScheduleDAGMILive::schedule()`, so, before this, mutations were applied
at this point. Now this is done during calls to `BuildDAG`, with IGLP
superseding other mutations (similar to the other schedulers). We may
end up scheduling regions multiple times, with mutations being applied
each time, so we need to track for
`AMDGPU::SchedulingPhase::PreRAReentry`
Currently, the only way for users to try these schedulers is via
`-misched=` . However, this overrides the default scheduler for all
targets. This causes problems for various toolchains / drivers which
spawn jobs for both x86 and AMDGPU -- e.g. hipcc. On the other hand,
`amdgpu-sched-strategy` only changes the scheduler for AMDGPU target.
Disable LCSSA pass in preparation for implementing temporal divergence
lowering in amdgpu divergence lowering. Breaks all cases where sgpr or
i1 values are used outside of the cycle with divergent exit.
Regenerate regression tests for amdgpu divergence lowering with LCSSA
disabled.
Update IntrinsicLaneMaskAnalyzer to stop tracking lcssa phis that are
lane masks.
The previous implementation wasn't maintaining a faithful IR
representation of how this really works. The value returned by
createEnqueuedBlockKernel wasn't actually used as a function, and
hacked up later to be a pointer to the runtime handle global
variable. In reality, the enqueued block is a struct where the first
field is a pointer to the kernel descriptor, not the kernel itself. We
were also relying on passing around a reference to a global using a
string attribute containing its name. It's better to base this on a
proper IR symbol reference during final emission.
This now avoids using a function attribute on kernels and avoids using
the additional "runtime-handle" attribute to populate the final
metadata. Instead, associate the runtime handle reference to the
kernel with the !associated global metadata. We can then get a final,
correctly mangled name at the end.
I couldn't figure out how to get rename-with-external-symbol behavior
using a combination of comdats and aliases, so leaves an IR pass to
externalize the runtime handles for codegen. If anything breaks, it's
most likely this, so leave avoiding this for a later step. Use a
special section name to enable this behavior. This also means it's
possible to declare enqueuable kernels in source without going through
the dedicated block syntax or other dedicated compiler support.
We could move towards initializing the runtime handle in the
compiler/linker. I have a working patch where the linker sets up the
first field of the handle, avoiding the need to export the block
kernel symbol for the runtime. We would need new relocations to get
the private and group sizes, but that would avoid the runtime's
special case handling that requires the device_enqueue_symbol metadata
field.
https://reviews.llvm.org/D141700
This is effectively a workaround for a bug in livedebugvalues, but seems
to potentially be a general improvement, as BB sections seems like it
could ruin the special 256-byte prelude scheme that
amdgpu-preload-kern-arg-prolog requires anyway. Moving it even later
doesn't seem to have any material impact, and just adds livedebugvalues
to the list of things which no longer have to deal with pseudo
multiple-entry functions.
AMDGPU debug-info isn't supported upstream yet, so the bug being avoided
isn't testable here. I am posting the patch upstream to avoid an
unnecessary diff with AMD's fork.
`RegisterClassInfo` was supposed to be kept alive between pass runs,
which wasn't being done leading to recomputations increasing the compile
time.
Now the Impl class is a member of the legacy and new passes so that it
is not reconstructed on every pass run.
---------
Co-authored-by: Christudasan Devadasan <christudasan.devadasan@amd.com>