88 Commits

Author SHA1 Message Date
Fraser Cormack
c82a6a0251
[AMDGPU] Use correct vector elt type when shrinking mfma scale (#123043)
This might be a copy/paste error. I don't think this an issue in
practice as the builtins/intrinsics are only legal with identical vector
element types.
2025-01-15 14:28:42 +00:00
Ramkumar Ramachandra
4a0d53a0b0
PatternMatch: migrate to CmpPredicate (#118534)
With the introduction of CmpPredicate in 51a895a (IR: introduce struct
with CmpInst::Predicate and samesign), PatternMatch is one of the first
key pieces of infrastructure that must be updated to match a CmpInst
respecting samesign information. Implement this change to Cmp-matchers.

This is a preparatory step in migrating the codebase over to
CmpPredicate. Since we no functional changes are desired at this stage,
we have chosen not to migrate CmpPredicate::operator==(CmpPredicate)
calls to use CmpPredicate::getMatching(), as that would have visible
impact on tests that are not yet written: instead, we call
CmpPredicate::operator==(Predicate), preserving the old behavior, while
also inserting a few FIXME comments for follow-ups.
2024-12-13 14:18:33 +00:00
Matt Arsenault
c74e2232f2
AMDGPU: Simplify demanded bits on readlane/writeline index arguments (#117963)
The main goal is to fold away wave64 code when compiled for wave32.
If we have out of bounds indexing, these will now clamp down to
a low bit which may CSE with the operations on the low half of the
wave.
2024-12-06 10:31:14 -05:00
Alex Voicu
48ec59c234
[llvm][AMDGPU] Fold llvm.amdgcn.wavefrontsize early (#114481)
Fold `llvm.amdgcn.wavefrontsize` early, during InstCombine, so that it's
concrete value is used throughout subsequent optimisation passes.
2024-11-25 10:29:50 +00:00
Matt Arsenault
0a6e8741dd
AMDGPU: Shrink used number of registers for mfma scale based on format (#117047)
Currently the builtins assume you are using an 8-bit format that requires
an 8 element vector. We can shrink the number of registers if the format
requires 4 or 6.
2024-11-21 09:08:05 -08:00
Matt Arsenault
01c9a14ccf
AMDGPU: Define v_mfma_f32_{16x16x128|32x32x64}_f8f6f4 instructions (#116723)
These use a new VOP3PX encoding for the v_mfma_scale_* instructions,
which bundles the pre-scale v_mfma_ld_scale_b32. None of the modifiers
are supported yet (op_sel, neg or clamp).

I'm not sure the intrinsic should really expose op_sel (or any of the
others). If I'm reading the documentation correctly, we should be able
to just have the raw scale operands and auto-match op_sel to byte
extract patterns.

The op_sel syntax also seems extra horrible in this usage, especially with the
usual assumed op_sel_hi=-1 behavior.
2024-11-21 08:51:58 -08:00
Matt Arsenault
ca1b35a6c8
AMDGPU: Add v_prng_b32 instruction for gfx950 (#116310)
Rand num instruction for stochastic rounding.
2024-11-18 10:54:54 -08:00
Jay Foad
85c17e4092
[LLVM] Make more use of IRBuilder::CreateIntrinsic. NFC. (#112706)
Convert many instances of:
  Fn = Intrinsic::getOrInsertDeclaration(...);
  CreateCall(Fn, ...)
to the equivalent CreateIntrinsic call.
2024-10-17 16:20:43 +01:00
Rahul Joshi
fa789dffb1
[NFC] Rename Intrinsic::getDeclaration to getOrInsertDeclaration (#111752)
Rename the function to reflect its correct behavior and to be consistent
with `Module::getOrInsertFunction`. This is also in preparation of
adding a new `Intrinsic::getDeclaration` that will have behavior similar
to `Module::getFunction` (i.e, just lookup, no creation).
2024-10-11 05:26:03 -07:00
Jay Foad
8d13e7b8c3
[AMDGPU] Qualify auto. NFC. (#110878)
Generated automatically with:
$ clang-tidy -fix -checks=-*,llvm-qualified-auto $(find
lib/Target/AMDGPU/ -type f)
2024-10-03 13:07:54 +01:00
Jay Foad
d2d947b7e2
[AMDGPU] Fold llvm.amdgcn.cvt.pkrtz when either operand is fpext (#108237)
This also generalizes the Undef handling and adds Poison handling.
2024-09-18 09:37:04 +01:00
Jay Foad
ff7eb1d0e9
[AMDGPU] Simplify API of matchFPExtFromF16. NFC. (#108223) 2024-09-11 17:03:27 +01:00
Jay Foad
f142f8afe2
[AMDGPU] Improve uniform argument handling in InstCombineIntrinsic (#105812)
Common up handling of intrinsics that are a no-op on uniform arguments.
This catches a couple of new cases:

readlane (readlane x, y), z -> readlane x, y
(for any z, does not have to equal y).

permlane64 (readfirstlane x) -> readfirstlane x
(and likewise for any other uniform argument to permlane64).
2024-08-23 14:43:31 +01:00
Changpeng Fang
06ab30b574
[AMDGPU] Constant folding of llvm.amdgcn.trig.preop (#98562)
If the parameters(the input and segment select) coming in to
amdgcn.trig.preop intrinsic are compile time constants, we pre-compute
the output of amdgcn.trig.preop on the CPU and replaces the uses with
the computed constant.

This work extends the patch https://reviews.llvm.org/D120150 to make it
a complete coverage.

For the segment select, only src1[4:0] are used. A segment select is
invalid if we are selecting the 53-bit segment beyond the [1200:0] range
of the 2/PI table. 0 is returned when a segment select is not valid.
2024-07-18 09:40:37 -07:00
Jay Foad
18ec885a26
[RFC][AMDGPU] Remove old llvm.amdgcn.buffer.* and tbuffer intrinsics (#93801)
They have been superseded by llvm.amdgcn.raw.buffer.* and
llvm.amdgcn.struct.buffer.*.
2024-06-10 12:14:51 +01:00
Eli Friedman
f893dccbba
Replace uses of ConstantExpr::getCompare. (#91558)
Use ICmpInst::compare() where possible, ConstantFoldCompareInstOperands
in other places. This only changes places where the either the fold is
guaranteed to succeed, or the code doesn't use the resulting compare if
we fail to fold.
2024-05-09 16:50:01 -07:00
Artem Tyurin
141145232f
[IRBuilder] Fold binary intrinsics (#80743)
Fixes https://github.com/llvm/llvm-project/issues/61240.
2024-03-15 09:58:25 +01:00
Yingwei Zheng
930996e9e4
[ValueTracking][NFC] Pass SimplifyQuery to computeKnownFPClass family (#80657)
This patch refactors the interface of the `computeKnownFPClass` family
to pass `SimplifyQuery` directly.
The motivation of this patch is to compute known fpclass with
`DomConditionCache`, which was introduced by
https://github.com/llvm/llvm-project/pull/73662. With
`DomConditionCache`, we can do more optimization with context-sensitive
information.

Example (extracted from
[fmt/format.h](e17bc67547/include/fmt/format.h (L3555-L3566))):
```
define float @test(float %x, i1 %cond) {
  %i32 = bitcast float %x to i32
  %cmp = icmp slt i32 %i32, 0
  br i1 %cmp, label %if.then1, label %if.else

if.then1:
  %fneg = fneg float %x
  br label %if.end

if.else:
  br i1 %cond, label %if.then2, label %if.end

if.then2:
  br label %if.end

if.end:
  %value = phi float [ %fneg, %if.then1 ], [ %x, %if.then2 ], [ %x, %if.else ]
  %ret = call float @llvm.fabs.f32(float %value)
  ret float %ret
}
```
We can prove the signbit of `%value` is always zero. Then the fabs can
be eliminated.
2024-02-06 02:30:12 +08:00
Valery Pykhtin
b8025d1482
Reapply "[AMDGPU] Add InstCombine rule for ballot.i64 intrinsic in wave32 mode." (#80303)
Reapply #71556 with added lit test constraint: `REQUIRES: amdgpu-registered-target`.

This reverts commit 9791e5414960f92396582b9e9ee503ac15799312.
2024-02-02 13:09:25 +01:00
Matt Arsenault
65f486c45d AMDGPU: Simplify else if to just else in AMDGPUInstCombineIntrinsic
Fixes #79738
2024-01-30 08:17:03 +05:30
Valery Pykhtin
9791e54149
Revert "[AMDGPU] Add InstCombine rule for ballot.i64 intrinsic in wave32 mode." (#78429)
Reverts llvm/llvm-project#71556

Fixes failures:
https://lab.llvm.org/buildbot/#/builders/188/builds/40541
https://lab.llvm.org/buildbot/#/builders/91/builds/21847
https://lab.llvm.org/buildbot/#/builders/98/builds/31671
https://lab.llvm.org/buildbot/#/builders/139/builds/57289
2024-01-17 14:12:07 +01:00
Valery Pykhtin
57b50ef017
[AMDGPU] Add InstCombine rule for ballot.i64 intrinsic in wave32 mode. (#71556)
Substitute with zero-extended to i64 ballot.i32 intrinsic.
2024-01-17 17:02:05 +07:00
Mariusz Sikora
2b83ceee3d
[AMDGPU][GFX12] Default component broadcast store (#76212)
For image and buffer stores the default behaviour on GFX12 is to set all
unset components to the value of the first component. So if we pass only
X component, it will be the same as XXXX, or XY same as XYXX.

This patch simplifies the passed vector of components in InstCombine by
removing components from the end that are equal to the first component.

For image stores it also trims DMask if necessary.

---------

Co-authored-by: Mateja Marjanovic <mmarjano@amd.com>
2024-01-12 08:26:08 +01:00
Nikita Popov
9d60e95bcd
[AMDGPU] Use poison instead of undef for non-demanded elements (#75914)
Return poison instead of undef for non-demanded lanes in the AMDGPU
demanded element simplification hook.

Also bail out of dmask is 0, as this case has special semantics:

> If DMASK==0, the TA overrides DMASK=1 and puts zeros in VGPR followed by
> LWE status if exists. TFE status is not generated since the fetch is dropped.
2023-12-20 11:01:59 +01:00
Mariusz Sikora
966416b9e8
[AMDGPU][GFX12] Add new v_permlane16 variants (#75475) 2023-12-15 10:14:38 +01:00
Nikita Popov
bc7ca9170f [AMDGPUInstCombine] Avoid use of ConstantExpr::getSExt() (NFC)
Let the IRBuilder handle the constant folding instead.
2023-10-02 11:12:04 +02:00
Matt Arsenault
edecb60481 Reapply "AMDGPU: Drop and auto-upgrade llvm.amdgcn.ldexp to llvm.ldexp"
This reverts commit d9333e360a7c52587ab6e4328e7493b357fb2cf3.
2023-09-13 08:38:48 +03:00
Matt Arsenault
61c8af6792 AMDGPU: InstCombine amdgcn.sqrt.f16 to sqrt.f16
There's nothing special about f16 sqrt handling.

https://reviews.llvm.org/D158090
2023-08-23 20:30:40 -04:00
Matt Arsenault
7c4aa3b37e AMDGPU: InstCombine amdgcn.rcp(amdgcn.sqrt) -> amdgcn.rsq
We currently have some wrong combines in the backend that
approximately do this.

https://reviews.llvm.org/D158002
2023-08-16 10:04:13 -04:00
Matt Arsenault
5ccfc4543d AMDGPU: Fold away mbcnt.hi in wave32 mode
This will allow libraries to drop some of the special casing based on
wave size.
2023-06-30 15:04:03 -04:00
Matt Arsenault
d9333e360a Revert "AMDGPU: Drop and auto-upgrade llvm.amdgcn.ldexp to llvm.ldexp"
This reverts commit 1159c670d40e3ef302264c681fe7e0268a550874.

Accidentally pushed wrong patch
2023-06-16 18:13:07 -04:00
Matt Arsenault
1159c670d4 AMDGPU: Drop and auto-upgrade llvm.amdgcn.ldexp to llvm.ldexp 2023-06-16 18:06:27 -04:00
Jay Foad
84313162bf [AMDGPU] Stop replacing amdgcn.ballot(1) with amdgcn.s.getreg(exec)
Rationale:
- It does not enable any further IR simplifications.
- It does not improve the generated code since the isel lowering of
  ballot also has special cases for 0 and 1.
- getreg is "too powerful" since it can read from many different
  registers, so its intrinsic properties have to be set very
  conservatively.

There is also a correctness problem that getreg can read from exec but
it is currently not marked as convergent.

Differential Revision: https://reviews.llvm.org/D153047
2023-06-16 17:15:52 +01:00
Mateja Marjanovic
7047cb5203 [AMDGPU] Trim trailing undefs from the end of image and buffer store
Remove undef values from the end of the vector operand in image and
buffer store instructions.
Also instead of call to computeKnownFPClass, use only findScalarElement.

Continuation of: 88421ea973916e Trim zero components from buffer and image stores

Differential Revision: https://reviews.llvm.org/D152440
2023-06-15 15:19:36 +02:00
Matt Arsenault
c6aaa0b14f AMDGPU: Perform basic folds on llvm.amdgcn.exp2 2023-06-15 07:01:06 -04:00
Matt Arsenault
10717f9294 AMDGPU: Add basic folds for llvm.amdgcn.log 2023-06-12 21:10:30 -04:00
Krzysztof Drewniak
faa2c678aa [AMDGPU] Add buffer intrinsics that take resources as pointers
In order to enable the LLVM frontend to better analyze buffer
operations (and to potentially enable more precise analyses on the
backend), define versions of the raw and structured buffer intrinsics
that use `ptr addrspace(8)` instead of `<4 x i32>` to represent their
rsrc arguments.

The new intrinsics are named by replacing `buffer.` with `buffer.ptr`.

One advantage to these intrinsic definitions is that, instead of
specifying that a buffer load/store will read/write some memory, we
can indicate that the memory read or written will be based on the
pointer argument. This means that, for example, a read from a
`noalias` buffer can be pulled out of a loop that is modifying a
distinct buffer.

In the future, we will define custom PseudoSourceValues that will
allow us to package up the (buffer, index, offset) triples that buffer
intrinsics contain and allow for more precise backend analysis.

This work also enables creating address space 7, which represents
manipulation of raw buffers using native LLVM load and store
instructions.

Where tests simply used a buffer intrinsic while testing some other
code path (such as the tests for VGPR spills), they have been updated
to use the new intrinsic form. Tests that are "about" buffer
intrinsics (for instance, those that ensure that they codegen as
expected) have been duplicated, either within existing files or into
new ones.

Depends on D145441

Reviewed By: arsenm, #amdgpu

Differential Revision: https://reviews.llvm.org/D147547
2023-06-05 16:59:07 +00:00
Mateja Marjanovic
c91246b71e fix failures caused by https://reviews.llvm.org/D146737
buildbot: https://lab.llvm.org/buildbot/#/builders/77/builds/27340
2023-06-05 13:10:27 +02:00
Mateja Marjanovic
88421ea973 [AMDGPU] Trim zero components from buffer and image stores
For image and buffer stores the default behaviour on GFX11 and
older is to set all unset components to zero. So if we pass
only X component it will be the same as X000, or XY same as XY00.

This patch simplifies the passed vector of components in InstCombine
by removing zero components from the end.

For image stores it also trims DMask if necessary.

Reviewed by: arsenm, foad, nhaehnle, piotr
2023-06-05 12:30:21 +02:00
Haojian Wu
55635433a8 Fix isKnownNeverInfOrNaN() call in AMDGPU after ORE removal 97b5cc214aee48e30391bfcd2cde4252163d7406 2023-06-02 09:32:46 +02:00
Matt Arsenault
8609df7c6e AMDGPU: Refine undef handling for llvm.amdgcn.class intrinsic
This barely matters since 99% are converted to the generic intrinsic now,
and the only real difference is the target intrinsic supports a variable
test mask. Start propagating poison. Prefer folding to a defined result (false)
for an undef test mask. Propagate undef for the first operand.
2023-06-01 18:35:55 -04:00
Matt Arsenault
9ef1333bf4 AMDGPU: Replace certain llvm.amdgcn.class uses with llvm.is.fpclass
Most transforms should now be performed on llvm.is.fpclass. Unlike the
generic intrinsic, this supports variable test masks.
2023-05-24 21:49:52 +01:00
Mateja Marjanovic
9c8c31eea4 Revert "[AMDGPU] Trim zero components from buffer and image stores"
This reverts commit 3181a6e3e7dae9292782216a55c5e1f0583c1668.
2023-05-18 17:02:01 +02:00
Matt Arsenault
8f3e64624c AMDGPU: Fold fmed3 of fpext sources to f16 fmed3
InstCombine already does this for minnum/maxnum. If we
also apply this to fmed3, we don't need to explicitly
use 16-bit fmed3 if we're not sure the target
supports 16-bit instructions yet.
2023-05-18 08:34:46 +01:00
Matt Arsenault
86d0b524f3 ValueTracking: Expand signature of isKnownNeverInfinity/NaN
This is in preparation for replacing the implementation
with a wrapper around computeKnownFPClass.
2023-05-16 20:42:58 +01:00
Mateja Marjanovic
3181a6e3e7 [AMDGPU] Trim zero components from buffer and image stores
For image and buffer stores the default behaviour on GFX11 and
older is to set all unset components to zero. So if we pass
only X component it will be the same as X000, or XY same as XY00.

This patch simplifies the passed vector of components in InstCombine
by removing zero components from the end.

For image stores it also trims DMask if necessary.

Reviewed By: foad, arsenm
Differential Revision: https://reviews.llvm.org/D146737
2023-05-15 18:23:27 +02:00
Matt Arsenault
1bce1beac4 AMDGPU: Reduce number of calls to computeKnownFPClass and pass all arguments
Makes assumes work for this case.
2023-04-26 13:02:17 -04:00
Craig Topper
1f60c8d025 [IR] Replace calls to ConstantFP::getNullValue with ConstantFP::getZero. NFC
There is no getNullValue in ConstantFP. Due to inheritance, we're calling
Constant::getNullValue which handles any type including FP.
Since we already know we want an FP constant we can use ConstantFP::getZero
which might be faster and is a more readable name for an FP zero.
2023-04-03 23:14:02 -07:00
Kazu Hirata
f8f3db2756 Use APInt::count{l,r}_{zero,one} (NFC) 2023-02-19 22:04:47 -08:00
Kazu Hirata
cbde2124f1 Use APInt::popcount instead of APInt::countPopulation (NFC)
This is for consistency with the C++20-style bit manipulation
functions in <bit>.
2023-02-19 11:29:12 -08:00