This might be a copy/paste error. I don't think this an issue in
practice as the builtins/intrinsics are only legal with identical vector
element types.
With the introduction of CmpPredicate in 51a895a (IR: introduce struct
with CmpInst::Predicate and samesign), PatternMatch is one of the first
key pieces of infrastructure that must be updated to match a CmpInst
respecting samesign information. Implement this change to Cmp-matchers.
This is a preparatory step in migrating the codebase over to
CmpPredicate. Since we no functional changes are desired at this stage,
we have chosen not to migrate CmpPredicate::operator==(CmpPredicate)
calls to use CmpPredicate::getMatching(), as that would have visible
impact on tests that are not yet written: instead, we call
CmpPredicate::operator==(Predicate), preserving the old behavior, while
also inserting a few FIXME comments for follow-ups.
The main goal is to fold away wave64 code when compiled for wave32.
If we have out of bounds indexing, these will now clamp down to
a low bit which may CSE with the operations on the low half of the
wave.
Currently the builtins assume you are using an 8-bit format that requires
an 8 element vector. We can shrink the number of registers if the format
requires 4 or 6.
These use a new VOP3PX encoding for the v_mfma_scale_* instructions,
which bundles the pre-scale v_mfma_ld_scale_b32. None of the modifiers
are supported yet (op_sel, neg or clamp).
I'm not sure the intrinsic should really expose op_sel (or any of the
others). If I'm reading the documentation correctly, we should be able
to just have the raw scale operands and auto-match op_sel to byte
extract patterns.
The op_sel syntax also seems extra horrible in this usage, especially with the
usual assumed op_sel_hi=-1 behavior.
Rename the function to reflect its correct behavior and to be consistent
with `Module::getOrInsertFunction`. This is also in preparation of
adding a new `Intrinsic::getDeclaration` that will have behavior similar
to `Module::getFunction` (i.e, just lookup, no creation).
Common up handling of intrinsics that are a no-op on uniform arguments.
This catches a couple of new cases:
readlane (readlane x, y), z -> readlane x, y
(for any z, does not have to equal y).
permlane64 (readfirstlane x) -> readfirstlane x
(and likewise for any other uniform argument to permlane64).
If the parameters(the input and segment select) coming in to
amdgcn.trig.preop intrinsic are compile time constants, we pre-compute
the output of amdgcn.trig.preop on the CPU and replaces the uses with
the computed constant.
This work extends the patch https://reviews.llvm.org/D120150 to make it
a complete coverage.
For the segment select, only src1[4:0] are used. A segment select is
invalid if we are selecting the 53-bit segment beyond the [1200:0] range
of the 2/PI table. 0 is returned when a segment select is not valid.
Use ICmpInst::compare() where possible, ConstantFoldCompareInstOperands
in other places. This only changes places where the either the fold is
guaranteed to succeed, or the code doesn't use the resulting compare if
we fail to fold.
This patch refactors the interface of the `computeKnownFPClass` family
to pass `SimplifyQuery` directly.
The motivation of this patch is to compute known fpclass with
`DomConditionCache`, which was introduced by
https://github.com/llvm/llvm-project/pull/73662. With
`DomConditionCache`, we can do more optimization with context-sensitive
information.
Example (extracted from
[fmt/format.h](e17bc67547/include/fmt/format.h (L3555-L3566))):
```
define float @test(float %x, i1 %cond) {
%i32 = bitcast float %x to i32
%cmp = icmp slt i32 %i32, 0
br i1 %cmp, label %if.then1, label %if.else
if.then1:
%fneg = fneg float %x
br label %if.end
if.else:
br i1 %cond, label %if.then2, label %if.end
if.then2:
br label %if.end
if.end:
%value = phi float [ %fneg, %if.then1 ], [ %x, %if.then2 ], [ %x, %if.else ]
%ret = call float @llvm.fabs.f32(float %value)
ret float %ret
}
```
We can prove the signbit of `%value` is always zero. Then the fabs can
be eliminated.
For image and buffer stores the default behaviour on GFX12 is to set all
unset components to the value of the first component. So if we pass only
X component, it will be the same as XXXX, or XY same as XYXX.
This patch simplifies the passed vector of components in InstCombine by
removing components from the end that are equal to the first component.
For image stores it also trims DMask if necessary.
---------
Co-authored-by: Mateja Marjanovic <mmarjano@amd.com>
Return poison instead of undef for non-demanded lanes in the AMDGPU
demanded element simplification hook.
Also bail out of dmask is 0, as this case has special semantics:
> If DMASK==0, the TA overrides DMASK=1 and puts zeros in VGPR followed by
> LWE status if exists. TFE status is not generated since the fetch is dropped.
Rationale:
- It does not enable any further IR simplifications.
- It does not improve the generated code since the isel lowering of
ballot also has special cases for 0 and 1.
- getreg is "too powerful" since it can read from many different
registers, so its intrinsic properties have to be set very
conservatively.
There is also a correctness problem that getreg can read from exec but
it is currently not marked as convergent.
Differential Revision: https://reviews.llvm.org/D153047
Remove undef values from the end of the vector operand in image and
buffer store instructions.
Also instead of call to computeKnownFPClass, use only findScalarElement.
Continuation of: 88421ea973916e Trim zero components from buffer and image stores
Differential Revision: https://reviews.llvm.org/D152440
In order to enable the LLVM frontend to better analyze buffer
operations (and to potentially enable more precise analyses on the
backend), define versions of the raw and structured buffer intrinsics
that use `ptr addrspace(8)` instead of `<4 x i32>` to represent their
rsrc arguments.
The new intrinsics are named by replacing `buffer.` with `buffer.ptr`.
One advantage to these intrinsic definitions is that, instead of
specifying that a buffer load/store will read/write some memory, we
can indicate that the memory read or written will be based on the
pointer argument. This means that, for example, a read from a
`noalias` buffer can be pulled out of a loop that is modifying a
distinct buffer.
In the future, we will define custom PseudoSourceValues that will
allow us to package up the (buffer, index, offset) triples that buffer
intrinsics contain and allow for more precise backend analysis.
This work also enables creating address space 7, which represents
manipulation of raw buffers using native LLVM load and store
instructions.
Where tests simply used a buffer intrinsic while testing some other
code path (such as the tests for VGPR spills), they have been updated
to use the new intrinsic form. Tests that are "about" buffer
intrinsics (for instance, those that ensure that they codegen as
expected) have been duplicated, either within existing files or into
new ones.
Depends on D145441
Reviewed By: arsenm, #amdgpu
Differential Revision: https://reviews.llvm.org/D147547
For image and buffer stores the default behaviour on GFX11 and
older is to set all unset components to zero. So if we pass
only X component it will be the same as X000, or XY same as XY00.
This patch simplifies the passed vector of components in InstCombine
by removing zero components from the end.
For image stores it also trims DMask if necessary.
Reviewed by: arsenm, foad, nhaehnle, piotr
This barely matters since 99% are converted to the generic intrinsic now,
and the only real difference is the target intrinsic supports a variable
test mask. Start propagating poison. Prefer folding to a defined result (false)
for an undef test mask. Propagate undef for the first operand.
InstCombine already does this for minnum/maxnum. If we
also apply this to fmed3, we don't need to explicitly
use 16-bit fmed3 if we're not sure the target
supports 16-bit instructions yet.
For image and buffer stores the default behaviour on GFX11 and
older is to set all unset components to zero. So if we pass
only X component it will be the same as X000, or XY same as XY00.
This patch simplifies the passed vector of components in InstCombine
by removing zero components from the end.
For image stores it also trims DMask if necessary.
Reviewed By: foad, arsenm
Differential Revision: https://reviews.llvm.org/D146737
There is no getNullValue in ConstantFP. Due to inheritance, we're calling
Constant::getNullValue which handles any type including FP.
Since we already know we want an FP constant we can use ConstantFP::getZero
which might be faster and is a more readable name for an FP zero.