llvm-project

Author	SHA1	Message	Date
Stanislav Mekhanoshin	a153e83e41	[AMDGPU] gfx1250 v_wmma_scale[16]_f32_16x16x128_f8f6f4 codegen (#152036 )	2025-08-04 19:16:34 -07:00
Changpeng Fang	d6094370cb	AMDGPU: Support v_wmma_f32_16x16x128_f8f6f4 on gfx1250 (#149684 ) Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>	2025-07-21 10:09:42 -07:00
Shilei Tian	d7ec80c897	[AMDGPU] Add support for `v_tanh_bf16` on gfx1250 (#147425 ) Co-authored-by: Mekhanoshin, Stanislav <Stanislav.Mekhanoshin@amd.com>	2025-07-14 16:30:18 -04:00
Darren Wihandi	9f3931b659	[AMDGPU] Fold fmed3 when inputs include infinity (#144824 )	2025-06-24 21:44:17 +09:00
Harrison Hao	0defde8e06	[AMDGPU] Support D16 folding for image.sample with multiple extractelement and fptrunc users (#141758 ) Now we only support D16 folding for `image sample` instructions with a single user: a `fptrunc` to half. However, we can actually support D16 folding for image.sample instructions with multiple users, as long as each user follows the pattern of extractelement followed by fptrunc to half. For example: ``` %sample = call <4 x float> @llvm.amdgcn.image.sample %e0 = extractelement <4 x float> %sample, i32 0 %h0 = fptrunc float %e0 to half %e1 = extractelement <4 x float> %sample, i32 1 %h1 = fptrunc float %e1 to half %e2 = extractelement <4 x float> %sample, i32 2 %h2 = fptrunc float %e2 to half ``` This change enables D16 folding for such cases and avoids generating `v_cvt_f16_f32_e32` instructions.	2025-06-18 09:00:07 +08:00
Matt Arsenault	af65cb68f5	AMDGPU: Move fpenvIEEEMode into TTI (#141945 )	2025-06-18 08:13:57 +09:00
Jay Foad	6b25f4439c	[AMDGPU] Detect trivially uniform arguments in InstCombine (#129897 ) Update one test to use an SGPR argument as the simplest way of getting a uniform value.	2025-06-09 12:06:03 +01:00
Ramkumar Ramachandra	b40e4ceaa6	[ValueTracking] Make Depth last default arg (NFC) (#142384 ) Having a finite Depth (or recursion limit) for computeKnownBits is very limiting, but is currently a load-bearing necessity, as all KnownBits are recomputed on each call and there is no caching. As a prerequisite for an effort to remove the recursion limit altogether, either using a clever caching technique, or writing a easily-invalidable KnownBits analysis, make the Depth argument in APIs in ValueTracking uniformly the last argument with a default value. This would aid in removing the argument when the time comes, as many callers that currently pass 0 explicitly are now updated to omit the argument altogether.	2025-06-03 17:12:24 +01:00
Matt Arsenault	fabbc40a36	AMDGPU: Make llvm.amdgcn.make.buffer.rsrc propagate poison (#141913 )	2025-05-29 15:38:29 +02:00
Pierre van Houtryve	2278f5e65b	[AMDGPU] Hoist readlane/readfirstlane through unary/binary operands (#129037 ) When a read(first)lane is used on a binary operator and the intrinsic is the only user of the operator, we can move the read(first)lane into the operand if the other operand is uniform. Unfortunately IC doesn't let us access UniformityAnalysis and thus we can't truly check uniformity, we have to do with a basic uniformity check which only allows constants or trivially uniform intrinsics calls. We can also do the same for unary and cast operators.	2025-05-13 12:00:49 +02:00
Matt Arsenault	038d357dde	AMDGPU: Use minimumnum/maximumnum for fmed3 with amdgpu-ieee=0 (#139546) Try to respect the signaling nan behavior of the instruction, so also start the special case fold for src2.	2025-05-12 20:31:52 +02:00
Matt Arsenault	08dd0406c6	AMDGPU: Use minnum instead of maxnum for fmed3 src2-nan fold (#139531 ) By the pseudocode in the ISA manual, if any input is a nan it acts like min3, which will fold to min2 of the other operands. The other cases fold to min, I'm not sure how this one was wrong.	2025-05-12 20:26:29 +02:00
Matt Arsenault	83107e02ea	AMDGPU: Disable most fmed3 folds for strictfp (#139530 )	2025-05-12 20:21:02 +02:00
Matt Arsenault	bb0a0782ea	AMDGPU: Use less surprising form of ConstantFP::get (#139248 )	2025-05-09 14:55:44 +02:00
Craig Topper	123758b1f4	[IRBuilder] Add versions of createInsertVector/createExtractVector that take a uint64_t index. (#138324 ) Most callers want a constant index. Instead of making every caller create a ConstantInt, we can do it in IRBuilder. This is similar to createInsertElement/createExtractElement.	2025-05-02 16:10:18 -07:00
Jay Foad	886f1199f0	[AMDGPU] Use variadic isa<>. NFC. (#137016 )	2025-04-24 08:19:09 +01:00
Jay Foad	e3350a6263	[AMDGPU] InstCombine llvm.amdgcn.ds.bpermute with uniform arguments (#130133 ) Reland #129895 with a fix to avoid trying to combine bpermute of bitcast.	2025-04-10 10:36:38 +01:00
Juan Manuel Martinez Caamaño	0375ef07c3	[Clang][AMDGPU] Add __builtin_amdgcn_cvt_off_f32_i4 (#133741 ) This built-in maps to `V_CVT_OFF_F32_I4` which treats its input as a 4-bit signed integer and returns `0.0625f * src`. SWDEV-518861	2025-04-02 19:51:40 +02:00
Matt Arsenault	c180fc80dc	AMDGPU: Replace unused permlane inputs with poison instead of undef (#131288 )	2025-03-18 17:37:44 +07:00
Matt Arsenault	052eca9ff7	AMDGPU: Replace unused update.dpp inputs with poison instead of undef (#131287 )	2025-03-18 17:33:58 +07:00
Matt Arsenault	8392573469	AMDGPU: Replace unused export inputs with poison instead of undef (#131286 )	2025-03-18 17:30:42 +07:00
Matt Arsenault	4a3ee4f72d	AMDGPU: Make fma_legacy intrinsic propagate poison (#131063 )	2025-03-14 11:42:47 +07:00
Matt Arsenault	37706894f8	AMDGPU: Make fmul_legacy intrinsic propagate poison (#131062 )	2025-03-14 11:39:47 +07:00
Matt Arsenault	a716459f2d	AMDGPU: Make ballot intrinsic propagate poison (#131061 )	2025-03-14 11:36:44 +07:00
Matt Arsenault	0d8a22d6ad	AMDGPU: Make fmed3 intrinsic propagate poison (#131060 )	2025-03-14 11:30:52 +07:00
Matt Arsenault	9b887f5277	AMDGPU: Make cvt_pknorm and cvt_pk intrinsics propagate poison (#131059 )	2025-03-14 11:27:50 +07:00
Matt Arsenault	0a78bd67b3	AMDGPU: Make frexp_exp and frexp_mant intrinsics propagate poison (#130915 )	2025-03-13 10:07:45 +07:00
Matt Arsenault	d8f17b3de1	AMDGPU: Make sqrt and rsq intrinsics propagate poison (#130914 )	2025-03-13 10:01:48 +07:00
Matt Arsenault	95ab95fd10	AMDGPU: Make rcp intrinsic propagate poison (#130913 )	2025-03-13 09:58:46 +07:00
Matt Arsenault	af755af200	AMDGPU: Handle demanded subvectors for readfirstlane (#128648 )	2025-03-07 17:54:15 +07:00
Jay Foad	78281fd12c	Revert "[AMDGPU] InstCombine llvm.amdgcn.ds.bpermute with uniform arguments (#129895 )" This reverts commit be5149a3158cbce3051629e450950ccb96926365. It caused build failures in the openmp-offload-amdgpu-runtime buildbot and others.	2025-03-06 15:05:19 +00:00
Jay Foad	be5149a315	[AMDGPU] InstCombine llvm.amdgcn.ds.bpermute with uniform arguments (#129895 )	2025-03-06 14:31:59 +00:00
Matt Arsenault	5c375c3283	AMDGPU: Fix worklist management in simplifyDemandedVectorEltsIntrinsic Fixes bot sanitizer error, but it does leave behind a dead instruction if there is a bundle for some reason.	2025-03-05 16:39:19 +07:00
Matt Arsenault	95c64b7ee6	AMDGPU: Reduce readfirstlane for single demanded vector element (#128647 ) If we are only extracting a single element, rewrite the intrinsic call to use the element type. We should extend this to arbitrary extract shuffles.	2025-03-05 08:35:56 +07:00
Matt Arsenault	d410f093da	AMDGPU: Simplify demanded vector elts of readfirstlane sources (#128646 ) Stub implementation of simplifyDemandedVectorEltsIntrinsic for readfirstlane.	2025-02-28 13:01:10 +07:00
Matt Arsenault	447abfcc09	AMDGPU: Fold bitcasts into readfirstlane, readlane, and permlane64 (#128494 ) We should handle this for all the handled readlane and dpp ops.	2025-02-27 20:59:11 +07:00
Matt Arsenault	5deb2aa9eb	AMDGPU: Make is.shared and is.private propagate poison (#128617 )	2025-02-25 12:56:43 +07:00
Fraser Cormack	c82a6a0251	[AMDGPU] Use correct vector elt type when shrinking mfma scale (#123043 ) This might be a copy/paste error. I don't think this an issue in practice as the builtins/intrinsics are only legal with identical vector element types.	2025-01-15 14:28:42 +00:00
Ramkumar Ramachandra	4a0d53a0b0	PatternMatch: migrate to CmpPredicate (#118534 ) With the introduction of CmpPredicate in 51a895a (IR: introduce struct with CmpInst::Predicate and samesign), PatternMatch is one of the first key pieces of infrastructure that must be updated to match a CmpInst respecting samesign information. Implement this change to Cmp-matchers. This is a preparatory step in migrating the codebase over to CmpPredicate. Since we no functional changes are desired at this stage, we have chosen not to migrate CmpPredicate::operator==(CmpPredicate) calls to use CmpPredicate::getMatching(), as that would have visible impact on tests that are not yet written: instead, we call CmpPredicate::operator==(Predicate), preserving the old behavior, while also inserting a few FIXME comments for follow-ups.	2024-12-13 14:18:33 +00:00
Matt Arsenault	c74e2232f2	AMDGPU: Simplify demanded bits on readlane/writeline index arguments (#117963 ) The main goal is to fold away wave64 code when compiled for wave32. If we have out of bounds indexing, these will now clamp down to a low bit which may CSE with the operations on the low half of the wave.	2024-12-06 10:31:14 -05:00
Alex Voicu	48ec59c234	[llvm][AMDGPU] Fold `llvm.amdgcn.wavefrontsize` early (#114481 ) Fold `llvm.amdgcn.wavefrontsize` early, during InstCombine, so that it's concrete value is used throughout subsequent optimisation passes.	2024-11-25 10:29:50 +00:00
Matt Arsenault	0a6e8741dd	AMDGPU: Shrink used number of registers for mfma scale based on format (#117047 ) Currently the builtins assume you are using an 8-bit format that requires an 8 element vector. We can shrink the number of registers if the format requires 4 or 6.	2024-11-21 09:08:05 -08:00
Matt Arsenault	01c9a14ccf	AMDGPU: Define v_mfma_f32_{16x16x128\|32x32x64}_f8f6f4 instructions (#116723 ) These use a new VOP3PX encoding for the v_mfma_scale_* instructions, which bundles the pre-scale v_mfma_ld_scale_b32. None of the modifiers are supported yet (op_sel, neg or clamp). I'm not sure the intrinsic should really expose op_sel (or any of the others). If I'm reading the documentation correctly, we should be able to just have the raw scale operands and auto-match op_sel to byte extract patterns. The op_sel syntax also seems extra horrible in this usage, especially with the usual assumed op_sel_hi=-1 behavior.	2024-11-21 08:51:58 -08:00
Matt Arsenault	ca1b35a6c8	AMDGPU: Add v_prng_b32 instruction for gfx950 (#116310 ) Rand num instruction for stochastic rounding.	2024-11-18 10:54:54 -08:00
Jay Foad	85c17e4092	[LLVM] Make more use of IRBuilder::CreateIntrinsic. NFC. (#112706 ) Convert many instances of: Fn = Intrinsic::getOrInsertDeclaration(...); CreateCall(Fn, ...) to the equivalent CreateIntrinsic call.	2024-10-17 16:20:43 +01:00
Rahul Joshi	fa789dffb1	[NFC] Rename `Intrinsic::getDeclaration` to `getOrInsertDeclaration` (#111752 ) Rename the function to reflect its correct behavior and to be consistent with `Module::getOrInsertFunction`. This is also in preparation of adding a new `Intrinsic::getDeclaration` that will have behavior similar to `Module::getFunction` (i.e, just lookup, no creation).	2024-10-11 05:26:03 -07:00
Jay Foad	8d13e7b8c3	[AMDGPU] Qualify auto. NFC. (#110878 ) Generated automatically with: $ clang-tidy -fix -checks=-*,llvm-qualified-auto $(find lib/Target/AMDGPU/ -type f)	2024-10-03 13:07:54 +01:00
Jay Foad	d2d947b7e2	[AMDGPU] Fold llvm.amdgcn.cvt.pkrtz when either operand is fpext (#108237 ) This also generalizes the Undef handling and adds Poison handling.	2024-09-18 09:37:04 +01:00
Jay Foad	ff7eb1d0e9	[AMDGPU] Simplify API of matchFPExtFromF16. NFC. (#108223 )	2024-09-11 17:03:27 +01:00
Jay Foad	f142f8afe2	[AMDGPU] Improve uniform argument handling in InstCombineIntrinsic (#105812 ) Common up handling of intrinsics that are a no-op on uniform arguments. This catches a couple of new cases: readlane (readlane x, y), z -> readlane x, y (for any z, does not have to equal y). permlane64 (readfirstlane x) -> readfirstlane x (and likewise for any other uniform argument to permlane64).	2024-08-23 14:43:31 +01:00

1 2 3

125 Commits