This change introduces a new kernel attribute that allows thread blocks to be mapped to clusters.
In addition, it also adds support of `+ptx90` PTX ISA support.
When assigning numbers to registers, skip any with neither uses nor
defs. This is will not have any impact at all on the final SASS but it
makes for slightly more readable PTX. This change should also ensure
that future minor changes are less likely to cause noisy diffs in
register numbering.
If alignment inference happens after NVPTXLowerArgs these addrspace wrap
intrinsics can prevent computeKnownBits from deriving alignment of
loads/stores from parameters. To solve this, we can insert an alignment
annotation on the generated intrinsic so that computeKnownBits does not
need to traverse through it to find the alignment.
These are identified by misc-include-cleaner. I've filtered out those
that break builds. Also, I'm staying away from llvm-config.h,
config.h, and Compiler.h, which likely cause platform- or
compiler-specific build failures.
[NVPTX] Add Prefetch tensormap intrinsics
This PR adds prefetch intrinsics with the relevant tensormap_space.
* Lit tests are added as part of prefetch.ll
* The generated PTX is verified with a 12.3 ptxas executable.
* Added docs for these intrinsics in NVPTXUsage.rst.
For more information, refer to the PTX ISA for prefetch intrinsic :
[Prefetch
Tensormap](https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-prefetch-prefetchu)
@durga4github @schwarzschild-radius
Add support for 3-input fmaxnum/fminnum/fmaximum/fminimum introduced in
PTX 8.8 for sm_100+:
- Use a tree reduction when 3-input operations are supported and the
reduction has the `reassoc` flag.
- If not on sm_100+/PTX 8.8, fallback to 2-input operations and use the
default shuffle reduction.
That change adds support for folding a SETCC when one or both of the
operands is a TRUNCATE with the appropriate no-wrap flags. This pattern
can occur when promoting i8 operations in NVPTX, and we currently have
some ISel rules to try to handle it.
This change rewrites LowerCall handling of byval arguments to vectorize
the loads in addition to the stores. In addition various minor NFC
updates and cleanups are made to reduce code duplication.
Fix a correctness bug in ISel lowering patterns for setcc of v4i8
extraction. Refactor and cleanup these patterns somewhat in general to
try to make them a bit more comprehensible.
In order to prevent the st.param and ld.param instructions which store
parameters and load return values from being sunk or hoisted out of a
call sequence, mark the callseq start and end nodes as reading and
writing memory.
Fixes#151329
Implements `(sign_extend|zero_extend (mul|shl) x, y) -> (mul.wide x, y)`
as a DAG combine.
Implements `(add (mul.wide a, b), c) -> (mad.wide a, b, c)` in
instruction selection.
This removes the only virtual function of MCSection.
NVPTXTargetStreamer::changeSection uses the MCSectionELF print method.
Change it to just print the section name.
Currently __nv_fast_tanhf() in libdevice maps to an nvvm intrinsic that
has not been upstreamed, which is causing issues when using the NVPTX
backend from upstream. Instead of upstreaming the intrinsic, we can
instead use the existing Intrinsic::tanh with the afn flag. This change
adds NVPTX backend support for ISD::TANH, adds auto-upgrade for the old
tanh_approx intrinsic to @llvm.tanh.f32 with afn flag so that libdevice
works properly upstream, and adds a basic codegen test and a case to the
auto-upgrade test.
This patch adds support for the im2col-w/w128 and scatter/gather modes
for TMA Copy and Prefetch intrinsics, completing support for all the
available modes. These are lowered through tablegen, building
on top of earlier patches.
* lit tests are added for all the combinations and verified with a
12.8 ptxas executable.
* Documentation is updated in the NVPTXUsage.rst file.
Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
We may still need to keep CopyToReg even after folding uses into vector
loads, since the original register may be used in other blocks.
Partially reverts 1fdbe6984976d9e85ab3b1a93e8de434a85c5646
This MR adds support for cmpxchg instructions with syncscope.
- Adds a new definition for atomic 3-operand instructions, with constant
operands for sem, scope and addsp.
- Lowers cmpxchg SDNodes populating sem, scope and addsp using
SDNodeXForms.
- Handle syncscope correctly for emulation loops in AtomicExpand, in
bracketInstructionWithFences.
- Modifies emitLeadingFence, emitTrailingFence to accept SyncScope as a
parameter. Modifies implementation of these in other backends, with the
parameter being ignored.
- Tests for a _slice_ of all possible combinations of the cmpxchg
instruction (with modifications to cmpxchg.py)
---------
Co-authored-by: gonzalobg <65027571+gonzalobg@users.noreply.github.com>
Replace uses of BFE with PRMT when lowering v4i8 vectors. This will
generally lead to equivalent or better SASS and reduces the number of
target specific operations we need to represent.
(https://cuda.godbolt.org/z/M75W6f8xd) Also implement KnownBits tracking
for PRMT allowing elimination of redundant AND instructions when
lowering various i8 operations.
Lower `fadd`, `fsub`, `fmul`, and `fma` to f32x2 variants introduced in
PTX 8.6 for sm_100+. Adds a new register class for v2f32 as a b64
register in PTX. This causes other vector operations like loads and
stores to lower as .b64 instead of .v2.b32 as appropriate.
Also update test cases to use the autogenerator.
This patch moves the lowering of the TMA Tensor prefetch
and S2G-copy intrinsics to tablegen itself. This is in preparation
for adding Blackwell-specific additions to these intrinsic.
The TMA reduction intrinsics lowering is kept intact (C++), and
hence the macro names are updated to reflect the current usage.
The existing tests have full coverage and continue to pass as expected.
Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
This change cleans up DAG-to-DAG instruction selection around FTZ and
SETP comparison mode. Largely these changes do not impact functionality
though support for `{sin.cos}.approx.ftz.f32` is added.
This change continues rewriting and cleanup around DAG ISel for
formal-arguments, return values, and function calls. This causes some
incidental changes, mostly to instruction ordering and register naming
but also a couple improvements caused by using scalar types earlier in
the lowering.
Remove `UnsafeFPMath` in `visitFMULForFMADistributiveCombine`,
`visitFSUBForFMACombine` and `visitFDIV`.
All affected tests are fixed by add fast math flags manually.
Propagate fast math flags when lowering fdiv in NVPTX backend, so it can
produce optimized dag when `unsafe-fp-math` is absent.
This change fixes v2i8 lowering for parameters and returned values. As
part of this work, I move the lowering for return values to use generic
ISD::STORE nodes as these are more flexible and have existing
legalization handling.
Note that calling a function with v2i8 arguments or returns is still not
working but this is left for a subsequent change as this MR is already
fairly large.
Partially addresses #128853
Currently, the NVPTXPrologEpilogPass will crash if LIFETIME_START or
LIFETIME_END instructions are encountered. Usually this isn't a problem
since a couple earlier passes will always remove them. However, when
using opt-bisect-limit crashes can occur. This can hinder debugging and
reveals a potential future problem if these optimization passes change
their behavior. https://cuda.godbolt.org/z/E81xxKGdb
This change updates NVPTXPrologEpilogPass and
NVPTXRegisterInfo::eliminateFrameIndex to gracefully handle these
instructions by simply removing them. While I'm here I also did some
general fixup in NVPTXPrologEpilogPass to make it look more like
PrologEpilogInserter (from which it was copied).
Allow directly storing an immediate instead of requiring that it first
be moved into a register. This makes for more compact and readable PTX.
An approach similar to this (using a ComplexPattern) this could be used
for most PTX instructions to avoid the need for `_[ri]+` variants and
boiler-plate.
This change consolidates and cleans up various NVPTXISD target-specific
nodes in order to simplify SDAG ISel. While there are some whitespace
changes in the emitted PTX it is otherwise a non-functional change.
NVPTXISD::Wrapper - This node was used to wrap external-symbol and
global-address nodes. It is redundant and has been removed. Instead we
use the non-target versions of these nodes and convert them
appropriately during ISel.
NVPTXISD::CALL - Much of the family of nodes used to represent a PTX
call instruction have been replaced by this new single node. It
corresponds to a single instruction and is therefore much simpler to
create and lower.
This change adds family-specific architecture variants support added in [PTX ISA
8.8](https://docs.nvidia.com/cuda/parallel-thread-execution/#ptx-isa-version-8-8).
These architecture variants have "f" suffix. For example, sm_100f.
This change doesn't promote existing features to family-specific
architecture.
## Purpose
This patch is one in a series of code-mods that annotate LLVM’s public
interface for export. This patch annotates the `llvm/Target` library.
These annotations currently have no meaningful impact on the LLVM build;
however, they are a prerequisite to support an LLVM Windows DLL (shared
library) build.
## Background
This effort is tracked in #109483. Additional context is provided in
[this
discourse](https://discourse.llvm.org/t/psa-annotating-llvm-public-interface/85307),
and documentation for `LLVM_ABI` and related annotations is found in the
LLVM repo
[here](https://github.com/llvm/llvm-project/blob/main/llvm/docs/InterfaceExportAnnotations.rst).
A sub-set of these changes were generated automatically using the
[Interface Definition Scanner (IDS)](https://github.com/compnerd/ids)
tool, followed formatting with `git clang-format`.
The bulk of this change is manual additions of `LLVM_ABI` to
`LLVMInitializeX` functions defined in .cpp files under llvm/lib/Target.
Adding `LLVM_ABI` to the function implementation is required here
because they do not `#include "llvm/Support/TargetSelect.h"`, which
contains the declarations for this functions and was already updated
with `LLVM_ABI` in a previous patch. I considered patching these files
with `#include "llvm/Support/TargetSelect.h"` instead, but since
TargetSelect.h is a large file with a bunch of preprocessor x-macro
stuff in it I was concerned it would unnecessarily impact compile times.
In addition, a number of unit tests under llvm/unittests/Target required
additional dependencies to make them build correctly against the LLVM
DLL on Windows using MSVC.
## Validation
Local builds and tests to validate cross-platform compatibility. This
included llvm, clang, and lldb on the following configurations:
- Windows with MSVC
- Windows with Clang
- Linux with GCC
- Linux with Clang
- Darwin with Clang
In the AArch64 version this helps reduce the number of blr instruction
(indirect jumps) in from 325 to 87, and reduces the size of the object
file by 4%. It seems to help make the code more efficient even if it
doesn't greatly affect compile time.
The AMDGPU variants are already marked as final.