The intention is to use a "copy" instead of a "sub" to handle the high
parts of 64-bit multiply for this specific case.
This unlocks copy prop use cases where the copy can be reused by later
multiply+add sequences if possible.
Fixes: SWDEV-487672, SWDEV-487669
This avoids regressions in a future AMDGPU commit. Previously we
would have a build_vector (extract_vector_elt x), undef with free
access to the elements bloated into a shuffle of one element + undef,
which has much worse combine support than the extract.
Alternatively could check aggressivelyPreferBuildVectorSources, but
I'm not sure it's really different than isExtractVecEltCheap.
Once again we have excessive TLI hooks with bad defaults. Permit this
for 32-bit element vectors, which are just use-different-register.
We should permit 16-bit vectors as cheap with legal packed instructions,
but I see some mixed improvements and regressions that need investigation.
This avoids some of the pending regressions after AMDGPU implements
isExtractVecEltCheap.
In a case like shl <value, undef>, splat k, because the second operand
was fully defined, we would fall through and use the splat value for the
first operand, losing the undef high bits. This would result in an additional
instruction to handle the high bits. Add some reduced testcases for different
opcodes for one of the regressions.
The motivation for this is to allow us to match strided accesses that
are emitted from the loop vectorizer with EVL tail folding (see #122232)
In these loops the step isn't loop invariant and is based off of
@llvm.experimental.get.vector.length.
We can relax this as long as we make sure to construct the updates after
the definition inside the loop, instead of the preheader.
I presume the restriction was previously added so that the step would
dominate the insertion point in the preheader. I can't think of why it
wouldn't be safe to calculate it in the loop otherwise.
I have a particular user downstream who likes to write shuffles in terms
of unions involving _BitInt(128) types. This isn't completely crazy
because there's a bunch of code in the wild which was written with SSE
in mind, so 128 bits is a common data fragment size.
The problem is that generic lowering scalarizes this to ELEN, and we end
up with really terrible extract/insert sequences if the i128 shuffle is
between other (non-i128) operations.
I explored trying to do this via generic lowering infrastructure, and
frankly got lost. Doing this a target specific DAG is a bit ugly -
really, there's nothing hugely target specific here - but oh well. If
reviewers prefer, I could probably phrase this as a generic DAG combine,
but I'm not sure that's hugely better. If reviewers have a strong
preference on how to handle this, let me know, but I may need a bit of
help.
A couple notes:
* The argument passing weirdness is due to a missing combine to turn a
build_vector of adjacent i64 loads back into a vector load. I'm a bit
surprised we don't get that, but the isel output clearly has the
build_vector at i64.
* The splat case I plan to revisit in another patch. That's a relatively
common pattern, and the fact I have to scalarize that to avoid an
infinite loop is non-ideal.
Support true16 format for v_cndmask_b16 in MC and CodeGen in true16 and
fake16 flow.
Since we are replacing `v_cndmask_b16` to `v_cndmask_b16_t16/fake16`, we
have to at least update the fake16 codeGen to get codeGen test passing.
For this case, we have to update the true16 and with fake16 together,
otherwise some of the true16 tests will fail
- Add support for `@llvm.exp2()`:
- LLVM: `float` -> PTX: `ex2.approx{.ftz}.f32`
- LLVM: `half` -> PTX: `ex2.approx.f16`
- LLVM: `<2 x half>` -> PTX: `ex2.approx.f16x2`
- LLVM: `bfloat` -> PTX: `ex2.approx.ftz.bf16`
- LLVM: `<2 x bfloat>` -> PTX: `ex2.approx.ftz.bf16x2`
- Any operations with non-native vector widths are expanded. On
targets not supporting f16/bf16, values are promoted to f32.
- Add *CONDITIONAL* support for `@llvm.log2()` [^1]:
- LLVM: `float` -> PTX: `lg2.approx{.ftz}.f32`
- Support for f16/bf16 is emulated by promoting values to f32.
[1]: CUDA implements `exp2()` with `ex2.approx` but `log2()` is
implemented differently, so this is off by default. To enable, use the
flag `-nvptx-approx-log2f32`.
``` - add clang builtin to Builtins.td
- link builtin in hlsl_intrinsics
- add codegen for spirv intrinsic and two directx intrinsics to retain
signedness information of the operands in CGBuiltin.cpp
- add semantic analysis in SemaHLSL.cpp
- add lowering of spirv intrinsic to spirv backend in
SPIRVInstructionSelector.cpp
- add lowering of directx intrinsics to WaveActiveOp dxil op in
DXIL.td
- add test cases to illustrate passespendent pr merges.
```
Resolves#70106
---------
Co-authored-by: Finn Plummer <canadienfinn@gmail.com>
For .wv widening instructions when checking if the opperand is vs1 or
vs2, we take into account whether or not it has a passthru. For tied
pseudos though their passthru is the vs2, and we weren't taking this
into account.
SM80 has fma for bfloat16 but not add/mul/sub. Currently these ops incur
a promotion to f32, but we can avoid this by writing them in terms of
the fma:
```
FADD(a, b) -> FMA(a, 1.0, b)
FMUL(a, b) -> FMA(a, b, -0.0)
FSUB(a, b) -> FMA(b, -1.0, a)
```
Unfortunately there is no `fma.ftz` so when ftz is enabled, we still
fall back to promotion.
This PR fixes `--target-env` version value in the test case
`llvm/test/CodeGen/SPIRV/validate/sycl-tangle-group-algorithms.ll`: the
issue was introduced in https://github.com/llvm/llvm-project/pull/122755
When passing an instruction with a register mask, the machine copy
propagation pass was dropping the information about some copy
instructions which define a register which is preserved by the mask,
because that register overlaps a register which is partially clobbered
by it. This resulted in a miscompilation for AArch64, because this
caused a live copy to be considered dead.
The fix is to clobber register masks by finding the set of reg units
which is preserved by the mask, and clobbering all units not in that
set.
This is based on #122472, and fixes the compile time performance
regressions which were caused by that.
When generating a constant vector, if `UseSplat` is false, the indices
different from the index of the extract can be filled with `poison`
instead of `undef`.
Closes https://github.com/llvm/llvm-project/issues/99116
Implements `firstbitlow` by extracting common functionality from
`firstbithigh` into a shared function while also fixing a bug for an edge
case where `u64x3` and larger vectors will attempt to create vectors
larger than the SPRIV max of 4.
---------
Co-authored-by: Steven Perron <stevenperron@google.com>
These were introduced in 1213a7a57fdc (Hexagon backend support,
2011-12-12) but they aren't present in libclangrt.builtins-hexagon. The
generic versions of these functions are present in the builtins, though.
So it should suffice to call those instead.
I noticed that NVPTX will sometimes emit `mad.lo` to multiply by 1, e.g.
in https://gcc.godbolt.org/z/4j47Y9W4c.
This happens when DAGCombiner operates on the add before the mul, so the
imad contraction happens regardless of whether the mul could have been
simplified.
To fix this, I remove `NVPTXISD::IMAD` and only combine to mad during
selection. This allows the default DAGCombiner patterns to simplify
the graph without any NVPTX-specific intervention.
APInt will fail when given a negative offset. SelectScratchSVAddr
utilizes this function and can be given a negative offset as well, so
this change modifies it to use APSInt instead.
This change adds IntrConvergent property to image.getlod intrinsic and
to several image.sample intrinsics. All image.sample intrinsics apart
from LOD(_L), Level 0(_LZ), Derivative(_D) will be marked as Convergent.
None of the changes in #122232 or the upcoming #122244 are specific to
the EVL, so split out the EVL tail-folded loops into separate
"integration tests" that reflect the output of the loop vectorizer.
Preparation for #121124
This PR provides tests added into
[PR](https://github.com/llvm/llvm-project/pull/121124) that add
selection patterns for instruction `v_sat_pk`, in order to specify the
change of the tests before and after the commit.
Pre-commit tests PR for #121124 : Add selection patterns for instruction
`v_sat_pk`
Minor followup to #122485 - if the source operands were widened half-size subvectors, then attempt to concatenate the subvectors directly, and then adjust the shuffle mask so references to the second operand now refer to the upper half of the concat result.
Replace S_XOR with S_ANDN2 when computing the kill mask in demote/kill
lowering. This has the effect of AND'ing demote/kill condition with exec
which is needed for proper live mask update.
The S_XOR is inadequate because it may return true for lane with exec=0.
This patch fixes an image corruption in game.
I think the issue went unnoticed because demote/kill condition is often
naturally dependent on exec, so AND'ing with exec is usually not
required.