llvm-project

Author	SHA1	Message	Date
Shilei Tian	70905e0afa	[RFC][IR] Remove `Constant::isZeroValue` (#181521 ) `Constant::isZeroValue` currently behaves same as `Constant::isNullValue` for all types except floating-point, where it additionally returns true for negative zero (`-0.0`). However, in practice, almost all callers operate on integer/pointer types where the two are equivalent, and the few FP-relevant callers have no meaningful dependence on the `-0.0` behavior. This PR removes `isZeroValue` to eliminate the confusing API. All callers are changed to `isNullValue` with no test failures. `isZeroValue` will be reintroduced in a future change with clearer semantics: when null pointers may have non-zero bit patterns, `isZeroValue` will check for bitwise-all-zeros, while `isNullValue` will check for the semantic null (which may be non-zero).	2026-02-15 12:06:42 -05:00
Domenic Nutile	5c72240617	[AMDGPU] Add DPP16 Row Share optimization for llvm.amdgcn.wave.shuffle (#177470 ) Adds logic to detect cases where the llvm.amdgcn.wave.shuffle intrinsic is being applied to an index operand that would make the result equivalent to the various Row Share flavors of DPP16 operations, and replaces the intrinsic and the instructions computing the index with an equivalent llvm.amdgcn.update.dpp call.	2026-02-06 15:31:34 -05:00
Matt Arsenault	8a83911c40	AMDGPU: Fix incorrect fold of undef for llvm.amdgcn.trig.preop (#179025 ) We were folding undef inputs to qnan which is incorrect. The instruction never returns nan. Out of bounds segment select will return 0, so fold undef segment to 0.	2026-02-04 17:56:55 +01:00
Matt Arsenault	b83160b944	AMDGPU: Use extractBitsAsZExtValue to get exponent in trig_preop folding (#179024 )	2026-02-02 08:53:01 +01:00
Krzysztof Drewniak	e7dd7b81ac	[AMDGPU] tensor_{load_to/store_from}_lds => ..._d2 simplification (#171540 ) This commit adds the rewrite ``` llvm.amdgcn.tensor.{load.to/store.from}.lds( <4 x i32> %d0, <8 x i32> %d1, <4 x i32> zeroinitializer, <4 x i32> zeroinitializer, i32 [cachepolicy]) => llvm.amdgcn.tensor.{load.to/store.from}.lds.d2( <4 x i32> %$d0, <8 x i32> %d1, i32 [cachepolicy]) ``` This is justifed because, when the short encoding that uses the NULL SGPR for registers 2 and 3 is used, the hardware acts as if those registers were 0, including in the gather mode. It is always safe not to run this transformation. (Note: tests were LLM'd and then tweaked.)	2025-12-15 08:11:03 -08:00
Nikita Popov	123d4d9b85	[AMGGPUInstCombine] Use getSigned() for frexp exponent It may be negative.	2025-12-12 11:15:45 +01:00
Jay Foad	979462c876	[AMDGPU] Eliminate InstCombineTables.td. NFC. (#170857 ) This also eliminates the generated file InstCombineTables.inc which was 99% identical to the existing AMDGPUGenSearchableTables.inc.	2025-12-06 07:27:45 +00:00
Stanislav Mekhanoshin	a153e83e41	[AMDGPU] gfx1250 v_wmma_scale[16]_f32_16x16x128_f8f6f4 codegen (#152036 )	2025-08-04 19:16:34 -07:00
Changpeng Fang	d6094370cb	AMDGPU: Support v_wmma_f32_16x16x128_f8f6f4 on gfx1250 (#149684 ) Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>	2025-07-21 10:09:42 -07:00
Shilei Tian	d7ec80c897	[AMDGPU] Add support for `v_tanh_bf16` on gfx1250 (#147425 ) Co-authored-by: Mekhanoshin, Stanislav <Stanislav.Mekhanoshin@amd.com>	2025-07-14 16:30:18 -04:00
Darren Wihandi	9f3931b659	[AMDGPU] Fold fmed3 when inputs include infinity (#144824 )	2025-06-24 21:44:17 +09:00
Harrison Hao	0defde8e06	[AMDGPU] Support D16 folding for image.sample with multiple extractelement and fptrunc users (#141758 ) Now we only support D16 folding for `image sample` instructions with a single user: a `fptrunc` to half. However, we can actually support D16 folding for image.sample instructions with multiple users, as long as each user follows the pattern of extractelement followed by fptrunc to half. For example: ``` %sample = call <4 x float> @llvm.amdgcn.image.sample %e0 = extractelement <4 x float> %sample, i32 0 %h0 = fptrunc float %e0 to half %e1 = extractelement <4 x float> %sample, i32 1 %h1 = fptrunc float %e1 to half %e2 = extractelement <4 x float> %sample, i32 2 %h2 = fptrunc float %e2 to half ``` This change enables D16 folding for such cases and avoids generating `v_cvt_f16_f32_e32` instructions.	2025-06-18 09:00:07 +08:00
Matt Arsenault	af65cb68f5	AMDGPU: Move fpenvIEEEMode into TTI (#141945 )	2025-06-18 08:13:57 +09:00
Jay Foad	6b25f4439c	[AMDGPU] Detect trivially uniform arguments in InstCombine (#129897 ) Update one test to use an SGPR argument as the simplest way of getting a uniform value.	2025-06-09 12:06:03 +01:00
Ramkumar Ramachandra	b40e4ceaa6	[ValueTracking] Make Depth last default arg (NFC) (#142384 ) Having a finite Depth (or recursion limit) for computeKnownBits is very limiting, but is currently a load-bearing necessity, as all KnownBits are recomputed on each call and there is no caching. As a prerequisite for an effort to remove the recursion limit altogether, either using a clever caching technique, or writing a easily-invalidable KnownBits analysis, make the Depth argument in APIs in ValueTracking uniformly the last argument with a default value. This would aid in removing the argument when the time comes, as many callers that currently pass 0 explicitly are now updated to omit the argument altogether.	2025-06-03 17:12:24 +01:00
Matt Arsenault	fabbc40a36	AMDGPU: Make llvm.amdgcn.make.buffer.rsrc propagate poison (#141913 )	2025-05-29 15:38:29 +02:00
Pierre van Houtryve	2278f5e65b	[AMDGPU] Hoist readlane/readfirstlane through unary/binary operands (#129037 ) When a read(first)lane is used on a binary operator and the intrinsic is the only user of the operator, we can move the read(first)lane into the operand if the other operand is uniform. Unfortunately IC doesn't let us access UniformityAnalysis and thus we can't truly check uniformity, we have to do with a basic uniformity check which only allows constants or trivially uniform intrinsics calls. We can also do the same for unary and cast operators.	2025-05-13 12:00:49 +02:00
Matt Arsenault	038d357dde	AMDGPU: Use minimumnum/maximumnum for fmed3 with amdgpu-ieee=0 (#139546) Try to respect the signaling nan behavior of the instruction, so also start the special case fold for src2.	2025-05-12 20:31:52 +02:00
Matt Arsenault	08dd0406c6	AMDGPU: Use minnum instead of maxnum for fmed3 src2-nan fold (#139531 ) By the pseudocode in the ISA manual, if any input is a nan it acts like min3, which will fold to min2 of the other operands. The other cases fold to min, I'm not sure how this one was wrong.	2025-05-12 20:26:29 +02:00
Matt Arsenault	83107e02ea	AMDGPU: Disable most fmed3 folds for strictfp (#139530 )	2025-05-12 20:21:02 +02:00
Matt Arsenault	bb0a0782ea	AMDGPU: Use less surprising form of ConstantFP::get (#139248 )	2025-05-09 14:55:44 +02:00
Craig Topper	123758b1f4	[IRBuilder] Add versions of createInsertVector/createExtractVector that take a uint64_t index. (#138324 ) Most callers want a constant index. Instead of making every caller create a ConstantInt, we can do it in IRBuilder. This is similar to createInsertElement/createExtractElement.	2025-05-02 16:10:18 -07:00
Jay Foad	886f1199f0	[AMDGPU] Use variadic isa<>. NFC. (#137016 )	2025-04-24 08:19:09 +01:00
Jay Foad	e3350a6263	[AMDGPU] InstCombine llvm.amdgcn.ds.bpermute with uniform arguments (#130133 ) Reland #129895 with a fix to avoid trying to combine bpermute of bitcast.	2025-04-10 10:36:38 +01:00
Juan Manuel Martinez Caamaño	0375ef07c3	[Clang][AMDGPU] Add __builtin_amdgcn_cvt_off_f32_i4 (#133741 ) This built-in maps to `V_CVT_OFF_F32_I4` which treats its input as a 4-bit signed integer and returns `0.0625f * src`. SWDEV-518861	2025-04-02 19:51:40 +02:00
Matt Arsenault	c180fc80dc	AMDGPU: Replace unused permlane inputs with poison instead of undef (#131288 )	2025-03-18 17:37:44 +07:00
Matt Arsenault	052eca9ff7	AMDGPU: Replace unused update.dpp inputs with poison instead of undef (#131287 )	2025-03-18 17:33:58 +07:00
Matt Arsenault	8392573469	AMDGPU: Replace unused export inputs with poison instead of undef (#131286 )	2025-03-18 17:30:42 +07:00
Matt Arsenault	4a3ee4f72d	AMDGPU: Make fma_legacy intrinsic propagate poison (#131063 )	2025-03-14 11:42:47 +07:00
Matt Arsenault	37706894f8	AMDGPU: Make fmul_legacy intrinsic propagate poison (#131062 )	2025-03-14 11:39:47 +07:00
Matt Arsenault	a716459f2d	AMDGPU: Make ballot intrinsic propagate poison (#131061 )	2025-03-14 11:36:44 +07:00
Matt Arsenault	0d8a22d6ad	AMDGPU: Make fmed3 intrinsic propagate poison (#131060 )	2025-03-14 11:30:52 +07:00
Matt Arsenault	9b887f5277	AMDGPU: Make cvt_pknorm and cvt_pk intrinsics propagate poison (#131059 )	2025-03-14 11:27:50 +07:00
Matt Arsenault	0a78bd67b3	AMDGPU: Make frexp_exp and frexp_mant intrinsics propagate poison (#130915 )	2025-03-13 10:07:45 +07:00
Matt Arsenault	d8f17b3de1	AMDGPU: Make sqrt and rsq intrinsics propagate poison (#130914 )	2025-03-13 10:01:48 +07:00
Matt Arsenault	95ab95fd10	AMDGPU: Make rcp intrinsic propagate poison (#130913 )	2025-03-13 09:58:46 +07:00
Matt Arsenault	af755af200	AMDGPU: Handle demanded subvectors for readfirstlane (#128648 )	2025-03-07 17:54:15 +07:00
Jay Foad	78281fd12c	Revert "[AMDGPU] InstCombine llvm.amdgcn.ds.bpermute with uniform arguments (#129895 )" This reverts commit be5149a3158cbce3051629e450950ccb96926365. It caused build failures in the openmp-offload-amdgpu-runtime buildbot and others.	2025-03-06 15:05:19 +00:00
Jay Foad	be5149a315	[AMDGPU] InstCombine llvm.amdgcn.ds.bpermute with uniform arguments (#129895 )	2025-03-06 14:31:59 +00:00
Matt Arsenault	5c375c3283	AMDGPU: Fix worklist management in simplifyDemandedVectorEltsIntrinsic Fixes bot sanitizer error, but it does leave behind a dead instruction if there is a bundle for some reason.	2025-03-05 16:39:19 +07:00
Matt Arsenault	95c64b7ee6	AMDGPU: Reduce readfirstlane for single demanded vector element (#128647 ) If we are only extracting a single element, rewrite the intrinsic call to use the element type. We should extend this to arbitrary extract shuffles.	2025-03-05 08:35:56 +07:00
Matt Arsenault	d410f093da	AMDGPU: Simplify demanded vector elts of readfirstlane sources (#128646 ) Stub implementation of simplifyDemandedVectorEltsIntrinsic for readfirstlane.	2025-02-28 13:01:10 +07:00
Matt Arsenault	447abfcc09	AMDGPU: Fold bitcasts into readfirstlane, readlane, and permlane64 (#128494 ) We should handle this for all the handled readlane and dpp ops.	2025-02-27 20:59:11 +07:00
Matt Arsenault	5deb2aa9eb	AMDGPU: Make is.shared and is.private propagate poison (#128617 )	2025-02-25 12:56:43 +07:00
Fraser Cormack	c82a6a0251	[AMDGPU] Use correct vector elt type when shrinking mfma scale (#123043 ) This might be a copy/paste error. I don't think this an issue in practice as the builtins/intrinsics are only legal with identical vector element types.	2025-01-15 14:28:42 +00:00
Ramkumar Ramachandra	4a0d53a0b0	PatternMatch: migrate to CmpPredicate (#118534 ) With the introduction of CmpPredicate in 51a895a (IR: introduce struct with CmpInst::Predicate and samesign), PatternMatch is one of the first key pieces of infrastructure that must be updated to match a CmpInst respecting samesign information. Implement this change to Cmp-matchers. This is a preparatory step in migrating the codebase over to CmpPredicate. Since we no functional changes are desired at this stage, we have chosen not to migrate CmpPredicate::operator==(CmpPredicate) calls to use CmpPredicate::getMatching(), as that would have visible impact on tests that are not yet written: instead, we call CmpPredicate::operator==(Predicate), preserving the old behavior, while also inserting a few FIXME comments for follow-ups.	2024-12-13 14:18:33 +00:00
Matt Arsenault	c74e2232f2	AMDGPU: Simplify demanded bits on readlane/writeline index arguments (#117963 ) The main goal is to fold away wave64 code when compiled for wave32. If we have out of bounds indexing, these will now clamp down to a low bit which may CSE with the operations on the low half of the wave.	2024-12-06 10:31:14 -05:00
Alex Voicu	48ec59c234	[llvm][AMDGPU] Fold `llvm.amdgcn.wavefrontsize` early (#114481 ) Fold `llvm.amdgcn.wavefrontsize` early, during InstCombine, so that it's concrete value is used throughout subsequent optimisation passes.	2024-11-25 10:29:50 +00:00
Matt Arsenault	0a6e8741dd	AMDGPU: Shrink used number of registers for mfma scale based on format (#117047 ) Currently the builtins assume you are using an 8-bit format that requires an 8 element vector. We can shrink the number of registers if the format requires 4 or 6.	2024-11-21 09:08:05 -08:00
Matt Arsenault	01c9a14ccf	AMDGPU: Define v_mfma_f32_{16x16x128\|32x32x64}_f8f6f4 instructions (#116723 ) These use a new VOP3PX encoding for the v_mfma_scale_* instructions, which bundles the pre-scale v_mfma_ld_scale_b32. None of the modifiers are supported yet (op_sel, neg or clamp). I'm not sure the intrinsic should really expose op_sel (or any of the others). If I'm reading the documentation correctly, we should be able to just have the raw scale operands and auto-match op_sel to byte extract patterns. The op_sel syntax also seems extra horrible in this usage, especially with the usual assumed op_sel_hi=-1 behavior.	2024-11-21 08:51:58 -08:00

1 2 3

132 Commits