9 Commits

Author SHA1 Message Date
Alex MacLean
3a84a4e55d
Reland "[NVPTX] Unify and extend barrier{.cta} intrinsic support" (#141143)
Note: This relands #140615 adding a ".count" suffix to the non-".all"
variants.

Our current intrinsic support for barrier intrinsics is confusing and
incomplete, with multiple intrinsics mapping to the same instruction and
intrinsic names not clearly conveying intrinsic semantics. Further, we
lack support for some variants. This change unifies the IR
representation to a single consistently named set of intrinsics.

- llvm.nvvm.barrier.cta.sync.aligned.all(i32)
- llvm.nvvm.barrier.cta.sync.aligned.count(i32, i32)
- llvm.nvvm.barrier.cta.arrive.aligned.count(i32, i32)
- llvm.nvvm.barrier.cta.sync.all(i32)
- llvm.nvvm.barrier.cta.sync.count(i32, i32)
- llvm.nvvm.barrier.cta.arrive.count(i32, i32)

The following Auto-Upgrade rules are used to maintain compatibility with
IR using the legacy intrinsics:

* llvm.nvvm.barrier0 --> llvm.nvvm.barrier.cta.sync.aligned.all(0)
* llvm.nvvm.barrier.n --> llvm.nvvm.barrier.cta.sync.aligned.all(x)
* llvm.nvvm.bar.sync --> llvm.nvvm.barrier.cta.sync.aligned.all(x)
* llvm.nvvm.barrier --> llvm.nvvm.barrier.cta.sync.aligned.count(x, y)
* llvm.nvvm.barrier.sync --> llvm.nvvm.barrier.cta.sync.all(x)
* llvm.nvvm.barrier.sync.cnt --> llvm.nvvm.barrier.cta.sync.count(x, y)
2025-05-22 19:38:10 -07:00
Alex Maclean
e72d8b2553 Revert "[NVPTX] Unify and extend barrier{.cta} intrinsic support (#140615)"
This reverts commit 735209c0688b10a66c24750422b35d8c2ad01bb5.
2025-05-22 17:28:43 +00:00
Alex MacLean
735209c068
[NVPTX] Unify and extend barrier{.cta} intrinsic support (#140615)
Our current intrinsic support for barrier intrinsics is confusing and
incomplete, with multiple intrinsics mapping to the same instruction and
intrinsic names not clearly conveying intrinsic semantics. Further, we
lack support for some variants. This change unifies the IR
representation to a single consistently named set of intrinsics.

- llvm.nvvm.barrier.cta.sync.aligned.all(i32)
- llvm.nvvm.barrier.cta.sync.aligned(i32, i32)
- llvm.nvvm.barrier.cta.arrive.aligned(i32, i32)
- llvm.nvvm.barrier.cta.sync.all(i32)
- llvm.nvvm.barrier.cta.sync(i32, i32)
- llvm.nvvm.barrier.cta.arrive(i32, i32)

The following Auto-Upgrade rules are used to maintain compatibility with
IR using the legacy intrinsics:

* llvm.nvvm.barrier0 --> llvm.nvvm.barrier.cta.sync.aligned.all(0)
* llvm.nvvm.barrier.n --> llvm.nvvm.barrier.cta.sync.aligned.all(x)
* llvm.nvvm.bar.sync --> llvm.nvvm.barrier.cta.sync.aligned.all(x)
* llvm.nvvm.barrier --> llvm.nvvm.barrier.cta.sync.aligned(x, y)
* llvm.nvvm.barrier.sync --> llvm.nvvm.barrier.cta.sync.all(x)
* llvm.nvvm.barrier.sync.cnt --> llvm.nvvm.barrier.cta.sync(x, y)
2025-05-21 08:14:15 -07:00
Matt Arsenault
9bdd9dc895
AMDGPU: Mark workitem ID intrinsics with range attribute (#136196)
This avoids the need to have special handling at every use site.
Unfortunately this means we unnecessarily emit AssertZext in the DAG
(where we already directly understand the range of the intrinsic), andt
we regress in undefined cases as we don't fold out asserts on undef.
2025-04-18 12:27:38 +02:00
Jon Chesterfield
cba9dc6e9d
[libc][nfc] Use common implementation of read_first_lane_u64 (#131027)
No codegen regression on either target. The two builtin_ffs implied on
nvptx CSE away.

```
define internal i64 @__gpu_read_first_lane_u64(i64 noundef %__lane_mask, i64 noundef %__x) #2 {
entry:
  %shr = lshr i64 %__x, 32
  %conv = trunc nuw i64 %shr to i32
  %conv1 = trunc i64 %__x to i32
  %conv2 = trunc i64 %__lane_mask to i32
  %0 = tail call range(i32 0, 33) i32 @llvm.cttz.i32(i32 %conv2, i1 true)
  %iszero = icmp eq i32 %conv2, 0
  %sub = select i1 %iszero, i32 -1, i32 %0
  %1 = tail call i32 @llvm.nvvm.shfl.sync.idx.i32(i32 %conv2, i32 %conv, i32 %sub, i32 31)
  %conv4 = sext i32 %1 to i64
  %shl = shl nsw i64 %conv4, 32
  %2 = tail call i32 @llvm.nvvm.shfl.sync.idx.i32(i32 %conv2, i32 %conv1, i32 %sub, i32 31)
  %conv7 = zext i32 %2 to i64
  %or = or disjoint i64 %shl, %conv7
  ret i64 %or
}
; becomes

define internal i64 @__gpu_competing_read_first_lane_u64(i64 noundef %__lane_mask, i64 noundef %__x) #2 {
entry:
  %shr = lshr i64 %__x, 32
  %conv = trunc nuw i64 %shr to i32
  %conv1 = trunc i64 %__x to i32
  %conv.i = trunc i64 %__lane_mask to i32
  %0 = tail call range(i32 0, 33) i32 @llvm.cttz.i32(i32 %conv.i, i1 true)
  %iszero = icmp eq i32 %conv.i, 0
  %sub.i = select i1 %iszero, i32 -1, i32 %0
  %1 = tail call i32 @llvm.nvvm.shfl.sync.idx.i32(i32 %conv.i, i32 %conv, i32 %sub.i, i32 31)
  %conv4 = zext i32 %1 to i64
  %shl = shl nuw i64 %conv4, 32
  %2 = tail call i32 @llvm.nvvm.shfl.sync.idx.i32(i32 %conv.i, i32 %conv1, i32 %sub.i, i32 31)
  %conv7 = zext i32 %2 to i64
  %or = or disjoint i64 %shl, %conv7
  ret i64 %or
}
```

The sext vs zext difference is vaguely interesting but since the bits
are immediately discarded in either case it make no odds. The amdgcn one
doesn't need CSE, the readfirstlane function is a single call to an
intrinsic.

Drive by fix to __gpu_match_all_u32, it was calling first_lane_u64 and
could use first_lane_u32 instead. Added the missing call to gpuintrin.c
test case and a stray missing static as well.
2025-03-12 21:29:46 +00:00
Jon Chesterfield
ab557afa40
[libc][nfc] Include instantiations of gpuintrin.h in IR test case (#130956)
Regenerated existing test case with include-generated-funcs to show the
lowered IR for each instantiation.
2025-03-12 13:33:27 +00:00
Joseph Huber
718cdeb9c7 [Clang] Fix test after new argument was added 2025-02-05 12:55:28 -06:00
Alex MacLean
4583f6d344
[NVPTX] Switch front-ends and tests to ptx_kernel cc (#120806)
the `ptx_kernel` calling convention is a more idiomatic and standard way
of specifying a NVPTX kernel than using the metadata which is not
supposed to change the meaning of the program. Further, checking the
calling convention is significantly faster than traversing the metadata,
improving compile time.

This change updates the clang and mlir frontends as well as the
NVPTXCtorDtorLowering pass to emit kernels using the calling convention.
In addition, this updates all NVPTX unit tests to use the calling
convention as well.
2025-01-07 18:24:50 -08:00
Joseph Huber
11cc826c0a
[Clang] Implement resource directory headers for common GPU intrinsics (#110179)
Summary:
All GPU based languages provide some way to access things like the
thread ID or other resources. However, this is spread between many
different languages and it varies between targets. The goal here is to
provide a resource directory header that just provides these in an
easier to understand way, primarily so this can be used for C/C++ code.
The interface aims to be common, to faciliate easier porting, but target
specific stuff could be put in the individual headers.
2024-11-11 10:09:55 -08:00