132 Commits

Author SHA1 Message Date
Shilei Tian
70905e0afa
[RFC][IR] Remove Constant::isZeroValue (#181521)
`Constant::isZeroValue` currently behaves same as
`Constant::isNullValue` for all types except floating-point, where it
additionally returns true for negative zero (`-0.0`). However, in
practice, almost all callers operate on integer/pointer types where the
two are equivalent, and the few FP-relevant callers have no meaningful
dependence on the `-0.0` behavior.

This PR removes `isZeroValue` to eliminate the confusing API. All
callers are changed to `isNullValue` with no test failures.

`isZeroValue` will be reintroduced in a future change with clearer
semantics: when null pointers may have non-zero bit patterns,
`isZeroValue` will check for bitwise-all-zeros, while `isNullValue` will
check for the semantic null (which
may be non-zero).
2026-02-15 12:06:42 -05:00
Domenic Nutile
5c72240617
[AMDGPU] Add DPP16 Row Share optimization for llvm.amdgcn.wave.shuffle (#177470)
Adds logic to detect cases where the llvm.amdgcn.wave.shuffle intrinsic
is being applied to an index operand that would make the result
equivalent to the various Row Share flavors of DPP16 operations, and
replaces the intrinsic and the instructions computing the index with an
equivalent llvm.amdgcn.update.dpp call.
2026-02-06 15:31:34 -05:00
Matt Arsenault
8a83911c40
AMDGPU: Fix incorrect fold of undef for llvm.amdgcn.trig.preop (#179025)
We were folding undef inputs to qnan which is incorrect. The instruction
never returns nan. Out of bounds segment select will return 0, so fold
undef segment to 0.
2026-02-04 17:56:55 +01:00
Matt Arsenault
b83160b944
AMDGPU: Use extractBitsAsZExtValue to get exponent in trig_preop folding (#179024) 2026-02-02 08:53:01 +01:00
Krzysztof Drewniak
e7dd7b81ac
[AMDGPU] tensor_{load_to/store_from}_lds => ..._d2 simplification (#171540)
This commit adds the rewrite

```
llvm.amdgcn.tensor.{load.to/store.from}.lds(
  <4 x i32> %d0, <8 x i32> %d1, <4 x i32> zeroinitializer,
  <4 x i32> zeroinitializer, i32 [cachepolicy])
=>
llvm.amdgcn.tensor.{load.to/store.from}.lds.d2(
  <4 x i32> %$d0, <8 x i32> %d1, i32 [cachepolicy])
```

This is justifed because, when the short encoding that uses the NULL
SGPR for registers 2 and 3 is used, the hardware acts as if those
registers were 0, including in the gather mode.

It is always safe not to run this transformation.

(Note: tests were LLM'd and then tweaked.)
2025-12-15 08:11:03 -08:00
Nikita Popov
123d4d9b85 [AMGGPUInstCombine] Use getSigned() for frexp exponent
It may be negative.
2025-12-12 11:15:45 +01:00
Jay Foad
979462c876
[AMDGPU] Eliminate InstCombineTables.td. NFC. (#170857)
This also eliminates the generated file InstCombineTables.inc which was
99% identical to the existing AMDGPUGenSearchableTables.inc.
2025-12-06 07:27:45 +00:00
Stanislav Mekhanoshin
a153e83e41
[AMDGPU] gfx1250 v_wmma_scale[16]_f32_16x16x128_f8f6f4 codegen (#152036) 2025-08-04 19:16:34 -07:00
Changpeng Fang
d6094370cb
AMDGPU: Support v_wmma_f32_16x16x128_f8f6f4 on gfx1250 (#149684)
Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>
2025-07-21 10:09:42 -07:00
Shilei Tian
d7ec80c897
[AMDGPU] Add support for v_tanh_bf16 on gfx1250 (#147425)
Co-authored-by: Mekhanoshin, Stanislav <Stanislav.Mekhanoshin@amd.com>
2025-07-14 16:30:18 -04:00
Darren Wihandi
9f3931b659
[AMDGPU] Fold fmed3 when inputs include infinity (#144824) 2025-06-24 21:44:17 +09:00
Harrison Hao
0defde8e06
[AMDGPU] Support D16 folding for image.sample with multiple extractelement and fptrunc users (#141758)
Now we only support D16 folding for `image sample` instructions with a
single user: a `fptrunc` to half.
However, we can actually support D16 folding for image.sample
instructions with multiple users,
as long as each user follows the pattern of extractelement followed by
fptrunc to half.
For example:
```
  %sample = call <4 x float> @llvm.amdgcn.image.sample
  %e0 = extractelement <4 x float> %sample, i32 0
  %h0 = fptrunc float %e0 to half
  %e1 = extractelement <4 x float> %sample, i32 1
  %h1 = fptrunc float %e1 to half
  %e2 = extractelement <4 x float> %sample, i32 2
  %h2 = fptrunc float %e2 to half
```
This change enables D16 folding for such cases and avoids generating
`v_cvt_f16_f32_e32` instructions.
2025-06-18 09:00:07 +08:00
Matt Arsenault
af65cb68f5
AMDGPU: Move fpenvIEEEMode into TTI (#141945) 2025-06-18 08:13:57 +09:00
Jay Foad
6b25f4439c
[AMDGPU] Detect trivially uniform arguments in InstCombine (#129897)
Update one test to use an SGPR argument as the simplest way of getting a
uniform value.
2025-06-09 12:06:03 +01:00
Ramkumar Ramachandra
b40e4ceaa6
[ValueTracking] Make Depth last default arg (NFC) (#142384)
Having a finite Depth (or recursion limit) for computeKnownBits is very
limiting, but is currently a load-bearing necessity, as all KnownBits
are recomputed on each call and there is no caching. As a prerequisite
for an effort to remove the recursion limit altogether, either using a
clever caching technique, or writing a easily-invalidable KnownBits
analysis, make the Depth argument in APIs in ValueTracking uniformly the
last argument with a default value. This would aid in removing the
argument when the time comes, as many callers that currently pass 0
explicitly are now updated to omit the argument altogether.
2025-06-03 17:12:24 +01:00
Matt Arsenault
fabbc40a36
AMDGPU: Make llvm.amdgcn.make.buffer.rsrc propagate poison (#141913) 2025-05-29 15:38:29 +02:00
Pierre van Houtryve
2278f5e65b
[AMDGPU] Hoist readlane/readfirstlane through unary/binary operands (#129037)
When a read(first)lane is used on a binary operator and the intrinsic is
the only user of the operator, we can move the read(first)lane into the
operand if the other operand is uniform.

Unfortunately IC doesn't let us access UniformityAnalysis and thus we
can't truly check uniformity, we have to do with a basic uniformity
check which only allows constants or trivially uniform intrinsics calls.

We can also do the same for unary and cast operators.
2025-05-13 12:00:49 +02:00
Matt Arsenault
038d357dde
AMDGPU: Use minimumnum/maximumnum for fmed3 with amdgpu-ieee=0
(#139546)

Try to respect the signaling nan behavior of the instruction,
so also start the special case fold for src2.
2025-05-12 20:31:52 +02:00
Matt Arsenault
08dd0406c6
AMDGPU: Use minnum instead of maxnum for fmed3 src2-nan fold (#139531)
By the pseudocode in the ISA manual, if any input is a nan it acts
like min3, which will fold to min2 of the other operands. The other
cases fold to min, I'm not sure how this one was wrong.
2025-05-12 20:26:29 +02:00
Matt Arsenault
83107e02ea
AMDGPU: Disable most fmed3 folds for strictfp (#139530) 2025-05-12 20:21:02 +02:00
Matt Arsenault
bb0a0782ea
AMDGPU: Use less surprising form of ConstantFP::get (#139248) 2025-05-09 14:55:44 +02:00
Craig Topper
123758b1f4
[IRBuilder] Add versions of createInsertVector/createExtractVector that take a uint64_t index. (#138324)
Most callers want a constant index. Instead of making every caller
create a ConstantInt, we can do it in IRBuilder. This is similar to
createInsertElement/createExtractElement.
2025-05-02 16:10:18 -07:00
Jay Foad
886f1199f0
[AMDGPU] Use variadic isa<>. NFC. (#137016) 2025-04-24 08:19:09 +01:00
Jay Foad
e3350a6263
[AMDGPU] InstCombine llvm.amdgcn.ds.bpermute with uniform arguments (#130133)
Reland #129895 with a fix to avoid trying to combine bpermute of
bitcast.
2025-04-10 10:36:38 +01:00
Juan Manuel Martinez Caamaño
0375ef07c3
[Clang][AMDGPU] Add __builtin_amdgcn_cvt_off_f32_i4 (#133741)
This built-in maps to `V_CVT_OFF_F32_I4` which treats its input as
a 4-bit signed integer and returns `0.0625f * src`.

SWDEV-518861
2025-04-02 19:51:40 +02:00
Matt Arsenault
c180fc80dc
AMDGPU: Replace unused permlane inputs with poison instead of undef (#131288) 2025-03-18 17:37:44 +07:00
Matt Arsenault
052eca9ff7
AMDGPU: Replace unused update.dpp inputs with poison instead of undef (#131287) 2025-03-18 17:33:58 +07:00
Matt Arsenault
8392573469
AMDGPU: Replace unused export inputs with poison instead of undef (#131286) 2025-03-18 17:30:42 +07:00
Matt Arsenault
4a3ee4f72d
AMDGPU: Make fma_legacy intrinsic propagate poison (#131063) 2025-03-14 11:42:47 +07:00
Matt Arsenault
37706894f8
AMDGPU: Make fmul_legacy intrinsic propagate poison (#131062) 2025-03-14 11:39:47 +07:00
Matt Arsenault
a716459f2d
AMDGPU: Make ballot intrinsic propagate poison (#131061) 2025-03-14 11:36:44 +07:00
Matt Arsenault
0d8a22d6ad
AMDGPU: Make fmed3 intrinsic propagate poison (#131060) 2025-03-14 11:30:52 +07:00
Matt Arsenault
9b887f5277
AMDGPU: Make cvt_pknorm and cvt_pk intrinsics propagate poison (#131059) 2025-03-14 11:27:50 +07:00
Matt Arsenault
0a78bd67b3
AMDGPU: Make frexp_exp and frexp_mant intrinsics propagate poison (#130915) 2025-03-13 10:07:45 +07:00
Matt Arsenault
d8f17b3de1
AMDGPU: Make sqrt and rsq intrinsics propagate poison (#130914) 2025-03-13 10:01:48 +07:00
Matt Arsenault
95ab95fd10
AMDGPU: Make rcp intrinsic propagate poison (#130913) 2025-03-13 09:58:46 +07:00
Matt Arsenault
af755af200
AMDGPU: Handle demanded subvectors for readfirstlane (#128648) 2025-03-07 17:54:15 +07:00
Jay Foad
78281fd12c Revert "[AMDGPU] InstCombine llvm.amdgcn.ds.bpermute with uniform arguments (#129895)"
This reverts commit be5149a3158cbce3051629e450950ccb96926365.

It caused build failures in the openmp-offload-amdgpu-runtime buildbot
and others.
2025-03-06 15:05:19 +00:00
Jay Foad
be5149a315
[AMDGPU] InstCombine llvm.amdgcn.ds.bpermute with uniform arguments (#129895) 2025-03-06 14:31:59 +00:00
Matt Arsenault
5c375c3283 AMDGPU: Fix worklist management in simplifyDemandedVectorEltsIntrinsic
Fixes bot sanitizer error, but it does leave behind a dead instruction
if there is a bundle for some reason.
2025-03-05 16:39:19 +07:00
Matt Arsenault
95c64b7ee6
AMDGPU: Reduce readfirstlane for single demanded vector element (#128647)
If we are only extracting a single element, rewrite the intrinsic call
to use the element type. We should extend this to arbitrary extract
shuffles.
2025-03-05 08:35:56 +07:00
Matt Arsenault
d410f093da
AMDGPU: Simplify demanded vector elts of readfirstlane sources (#128646)
Stub implementation of simplifyDemandedVectorEltsIntrinsic for
readfirstlane.
2025-02-28 13:01:10 +07:00
Matt Arsenault
447abfcc09
AMDGPU: Fold bitcasts into readfirstlane, readlane, and permlane64 (#128494)
We should handle this for all the handled readlane and dpp ops.
2025-02-27 20:59:11 +07:00
Matt Arsenault
5deb2aa9eb
AMDGPU: Make is.shared and is.private propagate poison (#128617) 2025-02-25 12:56:43 +07:00
Fraser Cormack
c82a6a0251
[AMDGPU] Use correct vector elt type when shrinking mfma scale (#123043)
This might be a copy/paste error. I don't think this an issue in
practice as the builtins/intrinsics are only legal with identical vector
element types.
2025-01-15 14:28:42 +00:00
Ramkumar Ramachandra
4a0d53a0b0
PatternMatch: migrate to CmpPredicate (#118534)
With the introduction of CmpPredicate in 51a895a (IR: introduce struct
with CmpInst::Predicate and samesign), PatternMatch is one of the first
key pieces of infrastructure that must be updated to match a CmpInst
respecting samesign information. Implement this change to Cmp-matchers.

This is a preparatory step in migrating the codebase over to
CmpPredicate. Since we no functional changes are desired at this stage,
we have chosen not to migrate CmpPredicate::operator==(CmpPredicate)
calls to use CmpPredicate::getMatching(), as that would have visible
impact on tests that are not yet written: instead, we call
CmpPredicate::operator==(Predicate), preserving the old behavior, while
also inserting a few FIXME comments for follow-ups.
2024-12-13 14:18:33 +00:00
Matt Arsenault
c74e2232f2
AMDGPU: Simplify demanded bits on readlane/writeline index arguments (#117963)
The main goal is to fold away wave64 code when compiled for wave32.
If we have out of bounds indexing, these will now clamp down to
a low bit which may CSE with the operations on the low half of the
wave.
2024-12-06 10:31:14 -05:00
Alex Voicu
48ec59c234
[llvm][AMDGPU] Fold llvm.amdgcn.wavefrontsize early (#114481)
Fold `llvm.amdgcn.wavefrontsize` early, during InstCombine, so that it's
concrete value is used throughout subsequent optimisation passes.
2024-11-25 10:29:50 +00:00
Matt Arsenault
0a6e8741dd
AMDGPU: Shrink used number of registers for mfma scale based on format (#117047)
Currently the builtins assume you are using an 8-bit format that requires
an 8 element vector. We can shrink the number of registers if the format
requires 4 or 6.
2024-11-21 09:08:05 -08:00
Matt Arsenault
01c9a14ccf
AMDGPU: Define v_mfma_f32_{16x16x128|32x32x64}_f8f6f4 instructions (#116723)
These use a new VOP3PX encoding for the v_mfma_scale_* instructions,
which bundles the pre-scale v_mfma_ld_scale_b32. None of the modifiers
are supported yet (op_sel, neg or clamp).

I'm not sure the intrinsic should really expose op_sel (or any of the
others). If I'm reading the documentation correctly, we should be able
to just have the raw scale operands and auto-match op_sel to byte
extract patterns.

The op_sel syntax also seems extra horrible in this usage, especially with the
usual assumed op_sel_hi=-1 behavior.
2024-11-21 08:51:58 -08:00