`Constant::isZeroValue` currently behaves same as
`Constant::isNullValue` for all types except floating-point, where it
additionally returns true for negative zero (`-0.0`). However, in
practice, almost all callers operate on integer/pointer types where the
two are equivalent, and the few FP-relevant callers have no meaningful
dependence on the `-0.0` behavior.
This PR removes `isZeroValue` to eliminate the confusing API. All
callers are changed to `isNullValue` with no test failures.
`isZeroValue` will be reintroduced in a future change with clearer
semantics: when null pointers may have non-zero bit patterns,
`isZeroValue` will check for bitwise-all-zeros, while `isNullValue` will
check for the semantic null (which
may be non-zero).
Adds logic to detect cases where the llvm.amdgcn.wave.shuffle intrinsic
is being applied to an index operand that would make the result
equivalent to the various Row Share flavors of DPP16 operations, and
replaces the intrinsic and the instructions computing the index with an
equivalent llvm.amdgcn.update.dpp call.
We were folding undef inputs to qnan which is incorrect. The instruction
never returns nan. Out of bounds segment select will return 0, so fold
undef segment to 0.
This commit adds the rewrite
```
llvm.amdgcn.tensor.{load.to/store.from}.lds(
<4 x i32> %d0, <8 x i32> %d1, <4 x i32> zeroinitializer,
<4 x i32> zeroinitializer, i32 [cachepolicy])
=>
llvm.amdgcn.tensor.{load.to/store.from}.lds.d2(
<4 x i32> %$d0, <8 x i32> %d1, i32 [cachepolicy])
```
This is justifed because, when the short encoding that uses the NULL
SGPR for registers 2 and 3 is used, the hardware acts as if those
registers were 0, including in the gather mode.
It is always safe not to run this transformation.
(Note: tests were LLM'd and then tweaked.)
Now we only support D16 folding for `image sample` instructions with a
single user: a `fptrunc` to half.
However, we can actually support D16 folding for image.sample
instructions with multiple users,
as long as each user follows the pattern of extractelement followed by
fptrunc to half.
For example:
```
%sample = call <4 x float> @llvm.amdgcn.image.sample
%e0 = extractelement <4 x float> %sample, i32 0
%h0 = fptrunc float %e0 to half
%e1 = extractelement <4 x float> %sample, i32 1
%h1 = fptrunc float %e1 to half
%e2 = extractelement <4 x float> %sample, i32 2
%h2 = fptrunc float %e2 to half
```
This change enables D16 folding for such cases and avoids generating
`v_cvt_f16_f32_e32` instructions.
Having a finite Depth (or recursion limit) for computeKnownBits is very
limiting, but is currently a load-bearing necessity, as all KnownBits
are recomputed on each call and there is no caching. As a prerequisite
for an effort to remove the recursion limit altogether, either using a
clever caching technique, or writing a easily-invalidable KnownBits
analysis, make the Depth argument in APIs in ValueTracking uniformly the
last argument with a default value. This would aid in removing the
argument when the time comes, as many callers that currently pass 0
explicitly are now updated to omit the argument altogether.
When a read(first)lane is used on a binary operator and the intrinsic is
the only user of the operator, we can move the read(first)lane into the
operand if the other operand is uniform.
Unfortunately IC doesn't let us access UniformityAnalysis and thus we
can't truly check uniformity, we have to do with a basic uniformity
check which only allows constants or trivially uniform intrinsics calls.
We can also do the same for unary and cast operators.
By the pseudocode in the ISA manual, if any input is a nan it acts
like min3, which will fold to min2 of the other operands. The other
cases fold to min, I'm not sure how this one was wrong.
Most callers want a constant index. Instead of making every caller
create a ConstantInt, we can do it in IRBuilder. This is similar to
createInsertElement/createExtractElement.
This might be a copy/paste error. I don't think this an issue in
practice as the builtins/intrinsics are only legal with identical vector
element types.
With the introduction of CmpPredicate in 51a895a (IR: introduce struct
with CmpInst::Predicate and samesign), PatternMatch is one of the first
key pieces of infrastructure that must be updated to match a CmpInst
respecting samesign information. Implement this change to Cmp-matchers.
This is a preparatory step in migrating the codebase over to
CmpPredicate. Since we no functional changes are desired at this stage,
we have chosen not to migrate CmpPredicate::operator==(CmpPredicate)
calls to use CmpPredicate::getMatching(), as that would have visible
impact on tests that are not yet written: instead, we call
CmpPredicate::operator==(Predicate), preserving the old behavior, while
also inserting a few FIXME comments for follow-ups.
The main goal is to fold away wave64 code when compiled for wave32.
If we have out of bounds indexing, these will now clamp down to
a low bit which may CSE with the operations on the low half of the
wave.
Currently the builtins assume you are using an 8-bit format that requires
an 8 element vector. We can shrink the number of registers if the format
requires 4 or 6.
These use a new VOP3PX encoding for the v_mfma_scale_* instructions,
which bundles the pre-scale v_mfma_ld_scale_b32. None of the modifiers
are supported yet (op_sel, neg or clamp).
I'm not sure the intrinsic should really expose op_sel (or any of the
others). If I'm reading the documentation correctly, we should be able
to just have the raw scale operands and auto-match op_sel to byte
extract patterns.
The op_sel syntax also seems extra horrible in this usage, especially with the
usual assumed op_sel_hi=-1 behavior.