llvm-project

Author	SHA1	Message	Date
Kevin McAfee	691ccf263a	[NVPTX] Implement computeKnownBitsForTargetNode for LoadV (#154165 ) Remove AND combines as they are no longer needed after this.	2025-08-20 18:57:15 +00:00
Rajat Bajpai	fad3272286	[NVPTX] Add support for "blocksareclusters" kernel attr (#152265 ) This change introduces a new kernel attribute that allows thread blocks to be mapped to clusters. In addition, it also adds support of `+ptx90` PTX ISA support.	2025-08-20 11:09:39 +05:30
Alex MacLean	d494eb0fa3	[NVPTX] Skip numbering unreferenced virtual registers (readability) (#154391 ) When assigning numbers to registers, skip any with neither uses nor defs. This is will not have any impact at all on the final SASS but it makes for slightly more readable PTX. This change should also ensure that future minor changes are less likely to cause noisy diffs in register numbering.	2025-08-19 12:27:46 -07:00
Drew Kersnar	069ad2353c	[NVPTXLowerArgs] Add align attribute to return value of addrspace.wrap intrinsic (#153889 ) If alignment inference happens after NVPTXLowerArgs these addrspace wrap intrinsics can prevent computeKnownBits from deriving alignment of loads/stores from parameters. To solve this, we can insert an alignment annotation on the generated intrinsic so that computeKnownBits does not need to traverse through it to find the alignment.	2025-08-19 11:13:57 -05:00
Kazu Hirata	cbf5af9668	[llvm] Remove unused includes (NFC) (#154051 ) These are identified by misc-include-cleaner. I've filtered out those that break builds. Also, I'm staying away from llvm-config.h, config.h, and Compiler.h, which likely cause platform- or compiler-specific build failures.	2025-08-17 23:46:35 -07:00
Alex MacLean	bc77363235	[NVPTX] Do not mark move of global address as cheap enabling more CSE (#153730 )	2025-08-15 10:17:34 -07:00
paperchalice	b671979b7e	[NVPTX] Remove `UnsafeFPMath` uses (#151479 ) Remove `UnsafeFPMath` in NVPTX part, it blocks some bugfixes related to clang and the ultimate goal is to remove `resetTargetOptions` method in `TargetMachine`, see FIXME in `resetTargetOptions`. See also https://discourse.llvm.org/t/rfc-honor-pragmas-with-ffp-contract-fast https://discourse.llvm.org/t/allowfpopfusion-vs-sdnodeflags-hasallowcontract	2025-08-14 08:42:29 +08:00
Alex MacLean	9e6b29137b	[NVPTX] miscellaneous minor cleanup (NFC) (#152329 )	2025-08-12 18:15:01 -07:00
Princeton Ferro	00369230c1	[NVPTX] expand extractelt(v2f32) with dynamic index (#153078 ) Addresses https://github.com/llvm/llvm-project/pull/126337#issuecomment-3162756334	2025-08-11 14:42:29 -07:00
Abhilash Majumder	fee6e539d0	[NVPTX] Add prefetch tensormap variant (#146203 ) [NVPTX] Add Prefetch tensormap intrinsics This PR adds prefetch intrinsics with the relevant tensormap_space. * Lit tests are added as part of prefetch.ll * The generated PTX is verified with a 12.3 ptxas executable. * Added docs for these intrinsics in NVPTXUsage.rst. For more information, refer to the PTX ISA for prefetch intrinsic : [Prefetch Tensormap](https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-prefetch-prefetchu) @durga4github @schwarzschild-radius	2025-08-07 16:19:21 +05:30
Princeton Ferro	9a592d9a84	[NVPTX] lower VECREDUCE min/max to 3-input on sm_100+ (#136253 ) Add support for 3-input fmaxnum/fminnum/fmaximum/fminimum introduced in PTX 8.8 for sm_100+: - Use a tree reduction when 3-input operations are supported and the reduction has the `reassoc` flag. - If not on sm_100+/PTX 8.8, fallback to 2-input operations and use the default shuffle reduction.	2025-08-06 21:45:21 -07:00
Alex MacLean	d27802a217	[DAGCombiner] Fold setcc of trunc, generalizing some NVPTX isel logic (#150270 ) That change adds support for folding a SETCC when one or both of the operands is a TRUNCATE with the appropriate no-wrap flags. This pattern can occur when promoting i8 operations in NVPTX, and we currently have some ISel rules to try to handle it.	2025-08-05 19:20:17 -07:00
Alex MacLean	5678aefc45	[NVPTX] Add support for integer min/max ReLU idiom (#151727 )	2025-08-04 20:46:51 -07:00
Alex MacLean	66e8163f53	[NVPTX] Vectorize loads when lowering of byval calls, misc. cleanup (#151070 ) This change rewrites LowerCall handling of byval arguments to vectorize the loads in addition to the stores. In addition various minor NFC updates and cleanups are made to reduce code duplication.	2025-08-01 15:02:02 -07:00
Princeton Ferro	92ca087b45	[NVPTX] fix type propagation when expanding Store[V4 -> V8] (#151576 ) This was an edge case we missed. Propagate the correct type when expanding a StoreV4 x <2 x float> to StoreV8 x float.	2025-07-31 16:52:42 -07:00
Alex MacLean	d7f77b2e82	[NVPTX] Cleanup various vestigial elements and fold together more table-gen (NFC) (#151447 )	2025-07-31 13:50:56 -07:00
Alex MacLean	d3f500f2d9	[NVPTX] Fixup ISel patterns for setcc of i8 extract (#151204 ) Fix a correctness bug in ISel lowering patterns for setcc of v4i8 extraction. Refactor and cleanup these patterns somewhat in general to try to make them a bit more comprehensible.	2025-07-30 17:37:27 -07:00
Alex MacLean	11f52ec4d1	[NVPTX] Mark callseq insts as reading and writing memory (#151376 ) In order to prevent the st.param and ld.param instructions which store parameters and load return values from being sunk or hoisted out of a call sequence, mark the callseq start and end nodes as reading and writing memory. Fixes #151329	2025-07-30 13:31:31 -07:00
Justin Fargnoli	42e0d30268	[NVPTX] Enhance `mul.wide` and `mad.wide` peepholes (#150477 ) Implements `(sign_extend\|zero_extend (mul\|shl) x, y) -> (mul.wide x, y)` as a DAG combine. Implements `(add (mul.wide a, b), c) -> (mad.wide a, b, c)` in instruction selection.	2025-07-30 08:57:19 -07:00
Alex MacLean	35693daa70	[NVPTX] Fix v2i8 call lowering, use generic ld/st nodes for call params (#146930 )	2025-07-28 10:41:51 -07:00
Fangrui Song	87c73f498d	Move MCSection::printSwitchToSection to MCAsmInfo This removes the only virtual function of MCSection. NVPTXTargetStreamer::changeSection uses the MCSectionELF print method. Change it to just print the section name.	2025-07-26 15:42:05 -07:00
Meredith Julian	be58069515	[LLVM][NVPTX] Upstream tanh intrinsic for libdevice (#149596 ) Currently __nv_fast_tanhf() in libdevice maps to an nvvm intrinsic that has not been upstreamed, which is causing issues when using the NVPTX backend from upstream. Instead of upstreaming the intrinsic, we can instead use the existing Intrinsic::tanh with the afn flag. This change adds NVPTX backend support for ISD::TANH, adds auto-upgrade for the old tanh_approx intrinsic to @llvm.tanh.f32 with afn flag so that libdevice works properly upstream, and adds a basic codegen test and a case to the auto-upgrade test.	2025-07-24 14:32:59 -07:00
Alex MacLean	11fba35916	[NVPTX] Add SimplifyDemandedBitsForTargetNode for PRMT (#149395 )	2025-07-22 18:44:50 -07:00
Alex MacLean	10812eb7c9	[NVPTX] Assert PRMT operands are of correct type (NFC) (#150104 )	2025-07-22 15:36:49 -07:00
Pecco	8ae4dee4d0	[NVPTX] Lower stmatrix intrinsics to PTX (#148561 ) Lower stmatrix intrinsics defined in #148377 to PTX. See [PTX Doc](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-store-instruction-stmatrix). --------- Co-authored-by: peterbell10 <peterbell10@openai.com>	2025-07-20 21:06:45 -07:00
Durgadoss R	3866e4e7f8	[NVPTX] Add im2colw/w128 modes support to TMA intrinsics (#148863 ) This patch adds support for the im2col-w/w128 and scatter/gather modes for TMA Copy and Prefetch intrinsics, completing support for all the available modes. These are lowered through tablegen, building on top of earlier patches. * lit tests are added for all the combinations and verified with a 12.8 ptxas executable. * Documentation is updated in the NVPTXUsage.rst file. Signed-off-by: Durgadoss R <durgadossr@nvidia.com>	2025-07-19 17:16:21 +05:30
Alex MacLean	965b68e8f2	[NVPTX] Prevent fptrunc of v2f32 from being folded into store (#149571 )	2025-07-18 14:20:13 -07:00
Princeton Ferro	d63ab5467d	[NVPTX] don't erase CopyToRegs when folding movs into loads (#149393 ) We may still need to keep CopyToReg even after folding uses into vector loads, since the original register may be used in other blocks. Partially reverts 1fdbe6984976d9e85ab3b1a93e8de434a85c5646	2025-07-18 14:11:31 -07:00
Alex MacLean	f480e1b825	[NVPTX] Add PRMT constant folding and cleanup usage of PRMT node (#148906 )	2025-07-17 11:10:23 -07:00
Fangrui Song	156e4cb10e	NVPTX: Fix clang -Wcovered-switch-default	2025-07-15 20:35:48 -07:00
Akshay Deodhar	0f1b16dd5f	[NVPTX] Add syncscope support for cmpxchg (#140812 ) This MR adds support for cmpxchg instructions with syncscope. - Adds a new definition for atomic 3-operand instructions, with constant operands for sem, scope and addsp. - Lowers cmpxchg SDNodes populating sem, scope and addsp using SDNodeXForms. - Handle syncscope correctly for emulation loops in AtomicExpand, in bracketInstructionWithFences. - Modifies emitLeadingFence, emitTrailingFence to accept SyncScope as a parameter. Modifies implementation of these in other backends, with the parameter being ignored. - Tests for a _slice_ of all possible combinations of the cmpxchg instruction (with modifications to cmpxchg.py) --------- Co-authored-by: gonzalobg <65027571+gonzalobg@users.noreply.github.com>	2025-07-15 17:08:52 -07:00
Alex MacLean	86203b6b33	[NVPTX] Use PRMT more widely, and improve folding around this instruction (#148261 ) Replace uses of BFE with PRMT when lowering v4i8 vectors. This will generally lead to equivalent or better SASS and reduces the number of target specific operations we need to represent. (https://cuda.godbolt.org/z/M75W6f8xd) Also implement KnownBits tracking for PRMT allowing elimination of redundant AND instructions when lowering various i8 operations.	2025-07-13 15:06:53 -07:00
Princeton Ferro	1fdbe69849	[NVPTX] support f32x2 instructions for sm_100+ (#126337 ) Lower `fadd`, `fsub`, `fmul`, and `fma` to f32x2 variants introduced in PTX 8.6 for sm_100+. Adds a new register class for v2f32 as a b64 register in PTX. This causes other vector operations like loads and stores to lower as .b64 instead of .v2.b32 as appropriate. Also update test cases to use the autogenerator.	2025-07-11 11:50:50 -07:00
Durgadoss R	c3ec38dc7c	[NVPTX][NFC] Move more TMA intrinsics lowering to tablegen (#147576 ) This patch moves the lowering of the TMA Tensor prefetch and S2G-copy intrinsics to tablegen itself. This is in preparation for adding Blackwell-specific additions to these intrinsic. The TMA reduction intrinsics lowering is kept intact (C++), and hence the macro names are updated to reflect the current usage. The existing tests have full coverage and continue to pass as expected. Signed-off-by: Durgadoss R <durgadossr@nvidia.com>	2025-07-11 12:33:46 +05:30
Fraser Cormack	a516c60ec3	[NFC] Correct typo: invertion -> inversion (#147995 )	2025-07-11 07:37:25 +01:00
Alex MacLean	b44c50d416	[NVPTX] Rework and cleanup FTZ ISel (#146410 ) This change cleans up DAG-to-DAG instruction selection around FTZ and SETP comparison mode. Largely these changes do not impact functionality though support for `{sin.cos}.approx.ftz.f32` is added.	2025-07-09 11:16:48 -07:00
Alex MacLean	475cd8dfaf	[NVPTX] Further cleanup call isel (#146411 ) This change continues rewriting and cleanup around DAG ISel for formal-arguments, return values, and function calls. This causes some incidental changes, mostly to instruction ordering and register naming but also a couple improvements caused by using scalar types earlier in the lowering.	2025-07-01 14:55:04 -07:00
paperchalice	613222ec33	[DAGCombiner] Remove `UnsafeFPMath` usage in `visitFSUBForFMACombine` etc. (#145637 ) Remove `UnsafeFPMath` in `visitFMULForFMADistributiveCombine`, `visitFSUBForFMACombine` and `visitFDIV`. All affected tests are fixed by add fast math flags manually. Propagate fast math flags when lowering fdiv in NVPTX backend, so it can produce optimized dag when `unsafe-fp-math` is absent.	2025-06-30 08:41:23 +08:00
Alex MacLean	f03782dd67	[NVPTX] Fixup v2i8 parameter and return lowering (#145585 ) This change fixes v2i8 lowering for parameters and returned values. As part of this work, I move the lowering for return values to use generic ISD::STORE nodes as these are more flexible and have existing legalization handling. Note that calling a function with v2i8 arguments or returns is still not working but this is left for a subsequent change as this MR is already fairly large. Partially addresses #128853	2025-06-27 09:26:10 -07:00
Alex MacLean	e933cfcfb2	[NVPTX] Fixup NVPTXPrologEpilogPass for opt-bisect-limit (#144136 ) Currently, the NVPTXPrologEpilogPass will crash if LIFETIME_START or LIFETIME_END instructions are encountered. Usually this isn't a problem since a couple earlier passes will always remove them. However, when using opt-bisect-limit crashes can occur. This can hinder debugging and reveals a potential future problem if these optimization passes change their behavior. https://cuda.godbolt.org/z/E81xxKGdb This change updates NVPTXPrologEpilogPass and NVPTXRegisterInfo::eliminateFrameIndex to gracefully handle these instructions by simply removing them. While I'm here I also did some general fixup in NVPTXPrologEpilogPass to make it look more like PrologEpilogInserter (from which it was copied).	2025-06-27 08:31:08 -07:00
Simon Pilgrim	72ffa799c8	[NVPTX] tryStoreParam - remove default-only switch statement. NFC. (#145948 ) #145581 removed all the remaining special cases from the switch statement leaving just the default, which MSVC complains about.	2025-06-26 20:13:04 +01:00
Alex MacLean	16e712e7c3	[NVPTX] Allow directly storing immediates to improve readability (#145552 ) Allow directly storing an immediate instead of requiring that it first be moved into a register. This makes for more compact and readable PTX. An approach similar to this (using a ComplexPattern) this could be used for most PTX instructions to avoid the need for `_[ri]+` variants and boiler-plate.	2025-06-25 18:46:39 -07:00
Alex MacLean	70333de6cf	[NVPTX] Consolidate and cleanup various NVPTXISD nodes (NFC) (#145581 ) This change consolidates and cleans up various NVPTXISD target-specific nodes in order to simplify SDAG ISel. While there are some whitespace changes in the emitted PTX it is otherwise a non-functional change. NVPTXISD::Wrapper - This node was used to wrap external-symbol and global-address nodes. It is redundant and has been removed. Instead we use the non-target versions of these nodes and convert them appropriately during ISel. NVPTXISD::CALL - Much of the family of nodes used to represent a PTX call instruction have been replaced by this new single node. It corresponds to a single instruction and is therefore much simpler to create and lower.	2025-06-25 11:42:21 -07:00
Princeton Ferro	e1cd450c8f	[NVPTX] fold movs into loads and stores (#144581 ) Fold movs into loads and stores by increasing the number of return values or operands. For example: ``` L: v2f16,ch = Load [p] e0 = extractelt L, 0 e1 = extractelt L, 1 consume(e0, e1) ``` ...becomes... ``` L: f16,f16,ch = LoadV2 [p] consume(L:0, L:1) ```	2025-06-24 16:02:13 -04:00
Alex MacLean	7ce76e1ad1	[NVPTX] Rename register classes after float register removal (NFC) (#145255 )	2025-06-23 10:53:36 -07:00
Rajat Bajpai	590066bee7	[NVPTX] Add family-specific architectures support (#141899 ) This change adds family-specific architecture variants support added in [PTX ISA 8.8](https://docs.nvidia.com/cuda/parallel-thread-execution/#ptx-isa-version-8-8). These architecture variants have "f" suffix. For example, sm_100f. This change doesn't promote existing features to family-specific architecture.	2025-06-19 12:18:17 +05:30
Andrew Rogers	19658d1474	[llvm] annotate interfaces in llvm/Target for DLL export (#143615 ) ## Purpose This patch is one in a series of code-mods that annotate LLVM’s public interface for export. This patch annotates the `llvm/Target` library. These annotations currently have no meaningful impact on the LLVM build; however, they are a prerequisite to support an LLVM Windows DLL (shared library) build. ## Background This effort is tracked in #109483. Additional context is provided in [this discourse](https://discourse.llvm.org/t/psa-annotating-llvm-public-interface/85307), and documentation for `LLVM_ABI` and related annotations is found in the LLVM repo [here](https://github.com/llvm/llvm-project/blob/main/llvm/docs/InterfaceExportAnnotations.rst). A sub-set of these changes were generated automatically using the [Interface Definition Scanner (IDS)](https://github.com/compnerd/ids) tool, followed formatting with `git clang-format`. The bulk of this change is manual additions of `LLVM_ABI` to `LLVMInitializeX` functions defined in .cpp files under llvm/lib/Target. Adding `LLVM_ABI` to the function implementation is required here because they do not `#include "llvm/Support/TargetSelect.h"`, which contains the declarations for this functions and was already updated with `LLVM_ABI` in a previous patch. I considered patching these files with `#include "llvm/Support/TargetSelect.h"` instead, but since TargetSelect.h is a large file with a bunch of preprocessor x-macro stuff in it I was concerned it would unnecessarily impact compile times. In addition, a number of unit tests under llvm/unittests/Target required additional dependencies to make them build correctly against the LLVM DLL on Windows using MSVC. ## Validation Local builds and tests to validate cross-platform compatibility. This included llvm, clang, and lldb on the following configurations: - Windows with MSVC - Windows with Clang - Linux with GCC - Linux with Clang - Darwin with Clang	2025-06-17 13:28:45 -07:00
Alex MacLean	00139f10c3	[NVPTX] Cleanup ld/st lowering (#143936 )	2025-06-17 09:00:18 -07:00
Fangrui Song	22ad0359f9	NVPTX: Replace deprecated MCExpr::print with MCAsmInfo::printExpr	2025-06-15 17:34:31 -07:00
David Green	48e54f3a22	[CostModel] Mark all TTIImpls as final. NFC (#143404 ) In the AArch64 version this helps reduce the number of blr instruction (indirect jumps) in from 325 to 87, and reduces the size of the object file by 4%. It seems to help make the code more efficient even if it doesn't greatly affect compile time. The AMDGPU variants are already marked as final.	2025-06-15 08:51:59 +01:00

1 2 3 4 5 ...

1726 Commits