llvm-project

Author	SHA1	Message	Date
Alex MacLean	3a84a4e55d	Reland "[NVPTX] Unify and extend barrier{.cta} intrinsic support" (#141143 ) Note: This relands #140615 adding a ".count" suffix to the non-".all" variants. Our current intrinsic support for barrier intrinsics is confusing and incomplete, with multiple intrinsics mapping to the same instruction and intrinsic names not clearly conveying intrinsic semantics. Further, we lack support for some variants. This change unifies the IR representation to a single consistently named set of intrinsics. - llvm.nvvm.barrier.cta.sync.aligned.all(i32) - llvm.nvvm.barrier.cta.sync.aligned.count(i32, i32) - llvm.nvvm.barrier.cta.arrive.aligned.count(i32, i32) - llvm.nvvm.barrier.cta.sync.all(i32) - llvm.nvvm.barrier.cta.sync.count(i32, i32) - llvm.nvvm.barrier.cta.arrive.count(i32, i32) The following Auto-Upgrade rules are used to maintain compatibility with IR using the legacy intrinsics: * llvm.nvvm.barrier0 --> llvm.nvvm.barrier.cta.sync.aligned.all(0) * llvm.nvvm.barrier.n --> llvm.nvvm.barrier.cta.sync.aligned.all(x) * llvm.nvvm.bar.sync --> llvm.nvvm.barrier.cta.sync.aligned.all(x) * llvm.nvvm.barrier --> llvm.nvvm.barrier.cta.sync.aligned.count(x, y) * llvm.nvvm.barrier.sync --> llvm.nvvm.barrier.cta.sync.all(x) * llvm.nvvm.barrier.sync.cnt --> llvm.nvvm.barrier.cta.sync.count(x, y)	2025-05-22 19:38:10 -07:00
Alex Maclean	e72d8b2553	Revert "[NVPTX] Unify and extend barrier{.cta} intrinsic support (#140615 )" This reverts commit 735209c0688b10a66c24750422b35d8c2ad01bb5.	2025-05-22 17:28:43 +00:00
Alex MacLean	735209c068	[NVPTX] Unify and extend barrier{.cta} intrinsic support (#140615 ) Our current intrinsic support for barrier intrinsics is confusing and incomplete, with multiple intrinsics mapping to the same instruction and intrinsic names not clearly conveying intrinsic semantics. Further, we lack support for some variants. This change unifies the IR representation to a single consistently named set of intrinsics. - llvm.nvvm.barrier.cta.sync.aligned.all(i32) - llvm.nvvm.barrier.cta.sync.aligned(i32, i32) - llvm.nvvm.barrier.cta.arrive.aligned(i32, i32) - llvm.nvvm.barrier.cta.sync.all(i32) - llvm.nvvm.barrier.cta.sync(i32, i32) - llvm.nvvm.barrier.cta.arrive(i32, i32) The following Auto-Upgrade rules are used to maintain compatibility with IR using the legacy intrinsics: * llvm.nvvm.barrier0 --> llvm.nvvm.barrier.cta.sync.aligned.all(0) * llvm.nvvm.barrier.n --> llvm.nvvm.barrier.cta.sync.aligned.all(x) * llvm.nvvm.bar.sync --> llvm.nvvm.barrier.cta.sync.aligned.all(x) * llvm.nvvm.barrier --> llvm.nvvm.barrier.cta.sync.aligned(x, y) * llvm.nvvm.barrier.sync --> llvm.nvvm.barrier.cta.sync.all(x) * llvm.nvvm.barrier.sync.cnt --> llvm.nvvm.barrier.cta.sync(x, y)	2025-05-21 08:14:15 -07:00
Matt Arsenault	9bdd9dc895	AMDGPU: Mark workitem ID intrinsics with range attribute (#136196 ) This avoids the need to have special handling at every use site. Unfortunately this means we unnecessarily emit AssertZext in the DAG (where we already directly understand the range of the intrinsic), andt we regress in undefined cases as we don't fold out asserts on undef.	2025-04-18 12:27:38 +02:00
Jon Chesterfield	cba9dc6e9d	[libc][nfc] Use common implementation of read_first_lane_u64 (#131027 ) No codegen regression on either target. The two builtin_ffs implied on nvptx CSE away. ``` define internal i64 @__gpu_read_first_lane_u64(i64 noundef %__lane_mask, i64 noundef %__x) #2 { entry: %shr = lshr i64 %__x, 32 %conv = trunc nuw i64 %shr to i32 %conv1 = trunc i64 %__x to i32 %conv2 = trunc i64 %__lane_mask to i32 %0 = tail call range(i32 0, 33) i32 @llvm.cttz.i32(i32 %conv2, i1 true) %iszero = icmp eq i32 %conv2, 0 %sub = select i1 %iszero, i32 -1, i32 %0 %1 = tail call i32 @llvm.nvvm.shfl.sync.idx.i32(i32 %conv2, i32 %conv, i32 %sub, i32 31) %conv4 = sext i32 %1 to i64 %shl = shl nsw i64 %conv4, 32 %2 = tail call i32 @llvm.nvvm.shfl.sync.idx.i32(i32 %conv2, i32 %conv1, i32 %sub, i32 31) %conv7 = zext i32 %2 to i64 %or = or disjoint i64 %shl, %conv7 ret i64 %or } ; becomes define internal i64 @__gpu_competing_read_first_lane_u64(i64 noundef %__lane_mask, i64 noundef %__x) #2 { entry: %shr = lshr i64 %__x, 32 %conv = trunc nuw i64 %shr to i32 %conv1 = trunc i64 %__x to i32 %conv.i = trunc i64 %__lane_mask to i32 %0 = tail call range(i32 0, 33) i32 @llvm.cttz.i32(i32 %conv.i, i1 true) %iszero = icmp eq i32 %conv.i, 0 %sub.i = select i1 %iszero, i32 -1, i32 %0 %1 = tail call i32 @llvm.nvvm.shfl.sync.idx.i32(i32 %conv.i, i32 %conv, i32 %sub.i, i32 31) %conv4 = zext i32 %1 to i64 %shl = shl nuw i64 %conv4, 32 %2 = tail call i32 @llvm.nvvm.shfl.sync.idx.i32(i32 %conv.i, i32 %conv1, i32 %sub.i, i32 31) %conv7 = zext i32 %2 to i64 %or = or disjoint i64 %shl, %conv7 ret i64 %or } ``` The sext vs zext difference is vaguely interesting but since the bits are immediately discarded in either case it make no odds. The amdgcn one doesn't need CSE, the readfirstlane function is a single call to an intrinsic. Drive by fix to __gpu_match_all_u32, it was calling first_lane_u64 and could use first_lane_u32 instead. Added the missing call to gpuintrin.c test case and a stray missing static as well.	2025-03-12 21:29:46 +00:00
Jon Chesterfield	ab557afa40	[libc][nfc] Include instantiations of gpuintrin.h in IR test case (#130956 ) Regenerated existing test case with include-generated-funcs to show the lowered IR for each instantiation.	2025-03-12 13:33:27 +00:00
Joseph Huber	718cdeb9c7	[Clang] Fix test after new argument was added	2025-02-05 12:55:28 -06:00
Alex MacLean	4583f6d344	[NVPTX] Switch front-ends and tests to ptx_kernel cc (#120806 ) the `ptx_kernel` calling convention is a more idiomatic and standard way of specifying a NVPTX kernel than using the metadata which is not supposed to change the meaning of the program. Further, checking the calling convention is significantly faster than traversing the metadata, improving compile time. This change updates the clang and mlir frontends as well as the NVPTXCtorDtorLowering pass to emit kernels using the calling convention. In addition, this updates all NVPTX unit tests to use the calling convention as well.	2025-01-07 18:24:50 -08:00
Joseph Huber	11cc826c0a	[Clang] Implement resource directory headers for common GPU intrinsics (#110179 ) Summary: All GPU based languages provide some way to access things like the thread ID or other resources. However, this is spread between many different languages and it varies between targets. The goal here is to provide a resource directory header that just provides these in an easier to understand way, primarily so this can be used for C/C++ code. The interface aims to be common, to faciliate easier porting, but target specific stuff could be put in the individual headers.	2024-11-11 10:09:55 -08:00

9 Commits