When lowering using sublane shuffles, we can sometimes end up with the
same mask as we started with. We already bail in these occasions, but we
weren't fully simplifying the new shuffle mask before testing if it
matched.
Fixes#153457
Add special handling of EXTRACT_SUBVECTOR, INSERT_SUBVECTOR,
EXTRACT_VECTOR_ELT, INSERT_VECTOR_ELT and SCALAR_TO_VECTOR in
isGuaranteedNotToBeUndefOrPoison. Make use of DemandedElts to improve
the analysis and only check relevant elements for each operand.
Also start using DemandedElts in the recursive calls that check
isGuaranteedNotToBeUndefOrPoison for all operands for operations that do
not create undef/poison. We can do that for a number of elementwise
operations for which the DemandedElts can be applied to every operand
(e.g. ADD, OR, BITREVERSE, TRUNCATE).
Depend on #152591 to fix
https://github.com/llvm/llvm-project/issues/149023.
Similar to an EH pad, there is no real advantage in "falling through" to
an indirect target of an INLINEASM_BR. And multiple indirect targets of
inline asm at the end of a function may be rotated infinitely.
Therefore, this patch avoids such optimization on indirect target of
inline asm as fall through.
This patch adds CodeGen support for qc.insbi and qc.insb instructions
defined in the Qualcomm uC Xqcibm extension. qc.insbi and qc.insb
inserts bits into destination register from immediate and register
operand respectively.
A sequence of `xor`, `and` & `xor` depending on appropriate conditions
are converted to `qc.insbi` or `qc.insb` which depends on the
immediate's value.
For a <8 x i32> -> <2 x i128> bitcast, that under aarch64 is split into
two halfs, the scalar i128 remainder was causing problems, causing a
crash with invalid vector types. This makes sure they are handled
correctly in fewerElementsBitcast.
780054d3ff18075a6bc433029f336931792b1d2d added support for
`ISD::AssertNoFPClass`.
This ISD node can be used with the `ppc_fp128` type, which is really
just two `f64s` and requires expanding when used with
`ISD::AssertNoFPClass`. Without the support for expanding the result, we
get an assertion because the legalizer does not know how to expand the
results of `ppc_fp128` with `ISD::AssertNoFPClass`.
```
ExpandFloatResult #0: t7: ppcf128 = AssertNoFPClass t5, TargetConstant:i32<3>
LLVM ERROR: Do not know how to expand the result of this operator!
```
Thus, this patch aims to add support for the expand so we no longer
assert.
This fixes#151375.
Follow-up PR to #153071, adding the remaining zvbb instructions
(VBREV8_V and VREV8_V), plus the zvbc instruction (VCLMUL_VV, VCLMUL_VX,
VCLMULH_VV, VCLMULH_VX).
Godbolt example: https://godbolt.org/z/ThdfP475a
In the example single-element vse is used to store reduction result
instead of scalar store ([this optimization was introduced by this
patch](https://reviews.llvm.org/D109482)). However, vmv.x.s can't be
eliminated here because it has other uses (e.g. CopyToReg), so it seems
more profitable to use scalar store (we already have store value in a
scalar register, and can save one vsetvli which is likely to be required
for single-element vse). The proposed solution is to this transform only
if vmv.x.s has one use (in store instruction)
When using the `amdgcn.init.whole.wave` intrinsic, we add dummy VGPR
arguments with the purpose of preserving their inactive lanes. The
pattern may look something like this:
```
entry:
call amdgcn.init.whole.wave
branch to shader or tail
shader:
$vInactive = IMPLICIT_DEF ; Tells regalloc it's safe to use the active lanes
actual code...
tail:
call amdgcn.cs.chain [...], implicit $vInactive
```
We should not report these VGPRs in the `.vgpr_count` metadata. This
patch achieves that goal by ignoring meta instructions and calls. This should
be safe since if those registers are actually used in any other context,
they will be counted there. The same reasoning applies in the general
case, so we don't explicitly check for the existence of `init.whole.wave`.
This is a reworked version of #133242, which was reverted in #144039
and split into smaller bits.
Cause:
1. `implicit_def` inside bundle does not count for define of reg in
machineinst verifier
2. Including `implicit_def` will cause relative reg not define, result
in `Bad machine code: Using an undefined physical register` in the
machineinst verifier
Fixes https://github.com/llvm/llvm-project/issues/139102
---------
Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com>
Constructing Target triple with `ObjectFile::makeTriple` instead of just
with `Arch` and leaving the rest unknown. Also creating the subtarget
with the `CPU`. AMDGPU needs the full triple and `CPU` to disassemble
correctly.
To run a full test, also fixed a failure in `SIPreAllocateWWMRegs` with
the `$noreg` operand in `DBG_VALUE`.
---------
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
fixes#135572
There are two problems that are causing problems first register types
are copied from older registers instead of evaluating the spirv types.
Second the way OpSelect is defined in SPIRVInstrInfo.td we always
default to integer for TernOpTyped. There seems to be a problem of
multiple matches in the getMatchTable so when executeMatchTable runs we
aren't getting the right opSelect.
Correcting the tablegen wasn't very easy so instead created an emitter
for Select that evaluated the register types. this passes the original
llvm/test/CodeGen/SPIRV/instructions/select.ll tests and the new float
ones I'm adding in issue-135572-emit-float-opselect.ll
`f16` is passed and returned in vector registers on both x86 on AArch64,
the same calling convention as `f32`, so it is a straightforward type to
support. The calling convention support already exists, added as part of
a6065f0fa55a ("Arm64EC entry/exit thunks, consolidated. (#79067)").
Thus, add mangling and remove the error in order to make `half` work.
MSVC does not yet support `_Float16`, so for now this will remain an
LLVM-only extension.
Fixes the `f16` portion of
https://github.com/llvm/llvm-project/issues/94434
In review of bbde6b, I had originally proposed that we support the
legacy text format. As review evolved, it bacame clear this had been a
bad idea (too much complexity), but in order to let that patch finally
move forward, I approved the change with the variant. This change undoes
the variant, and updates all the tests to just use the array form.
The check should be about unsigned 16-bit immediates, not signed ones.
This is not a bug per-se, as the old codegen was correct for the
uint16_max case, it just didn't end up using `qc.e.bgeui`, which we
would prefer it did.
Since the `dontcall-*` attributes are checked both by
`FastISel`/`GlobalISel` and `SelectionDAGBuilder`, and both `FastISel`
and `GlobalISel` bail for calls on Arm64EC for AFTER doing the check, we
ended up emitting duplicate copies of this error.
This change moves the checking for `dontcall-*` in `FastISel` and
`GlobalISel` to after it has been successfully lowered.
Fixes https://github.com/llvm/llvm-project/issues/149230
Previously, even with simd enabled via `-mattr=+simd128`, the compiler
cannot utilize v128 to optimize loads and setcc of i128, instead
legalizing it to consecutive i64s.
This PR then adds support for setcc of i128 by converting them to
v16i8's anytrue and alltrue; consequently, this benefits memcmp of 16
bytes or more (when simd128 is present).
The check for enabling this optimization is if the comparison operand is
either a load or an integer in i128, with the comparison code being
either `EQ | NE`, without `NoImplicitFloat` function flag.
Inspiration taken from RISCV's isel lowering.
fixes#151764
This fix has two parts first we track all lifetime intrinsics and if
they are users of an alloca of a target extention like dx.RawBuffer then
we eliminate those memory intrinsics when we visit the alloca.
We do step one to allow us to use the Dead Store Elimination Pass. This
removes the alloca and simplifies the use of the target extention back
to using just the global. That keeps things in a form the
DXILBitcodeWriter is expecting.
Obviously to pull this off we needed to bring back the legacy pass
manager plumbing for the DSE pass and hook it up into the DirectX
backend.
The net impact of this change is that DML shader pass rate went from
89.72% (4268 successful compilations) to 90.98% (4328 successful
compilations).
Not quite NFC as it looks like the original intrinsic-handling code
never got updated to use records. This was never caught because that
code wasn't tested. I've adjusted an existing test so the behaviour is
now covered.
When an undef/poison value is lowered as a an immediate, it becomes -1.
When reaching the backend, the -1 was printed as operand to
OpVectorShuffle instead of the proper 0xFFFFFFFF.
From the SPIR-V spec:
A Component literal may also be FFFFFFFF, which means the
corresponding result component has no source and is undefined.
The reason the existing tests were passing `spirv-val` was because the
binary format was used as output, meaning the `-1` was lowered to
`0xFFFFFFFF`. But when the text format is used, `-1` is emitted as-is
which is wrong.
Fixes#151691
M68k's SETCC instruction (`scc`) distinctly fills the destination byte
with all 1s. If boolean contents are set to `ZeroOrOneBooleanContent`,
LLVM can mistakenly think the destination holds `0x01` instead of `0xff`
and emit broken code as a result. This change corrects the boolean
content type to `ZeroOrNegativeOneBooleanContent`.
For example, this IR:
```llvm
define dso_local signext range(i8 0, 2) i8 @testBool(i32 noundef %a) local_unnamed_addr #0 {
entry:
%cmp = icmp eq i32 %a, 4660
%. = zext i1 %cmp to i8
ret i8 %.
}
```
would previously build as:
```asm
testBool: ; @testBool
cmpi.l #4660, (4,%sp)
seq %d0
and.l #255, %d0
rts
```
Notice the `zext` is erroneously not clearing the low bits, and thus the
register returns with 255 instead of 1. This patch fixes the issue:
```asm
testBool: ; @testBool
cmpi.l #4660, (4,%sp)
seq %d0
and.l #1, %d0
rts
```
Most of the tests containing `scc` suffered from the same value error as
described above, so those tests have been updated to match the new
output (which also logically corrects them).
This PR adds support for the following instructions to the RISC-V
VLOptimizer: vandn.vx, vandn.vv, vbrev.v, vclz.v, vcpop.v, vctz.v,
vror.vi, vror.vx, vror.vv, vrol.vx, vrol.vv.
Partially fix#149023.
The original code `MRI.def_begin(Reg)->getParent()` may return the
incorrect MI, as the physical register `Reg` may have multiple
definitions.
This patch selects the correct MI to verify by comparing the MBB of each
definition.
New testcase hangs with -O1/2/3 enabled. The BranchFolding may be to
blame.
In the [ [CodeGen] Store call frame size in
MachineBasicBlock](https://reviews.llvm.org/D156113), it mentions When a
basic block has been split in the middle of a call sequence. the call
frame size may not be zero, it need to set the setCallFrameSize for the
new MachineBasicBlock. but in the function `splitMBB(BlockSplitInfo
&BSI)` in the llvm/lib/Target/PowerPC/PPCReduceCRLogicals.cpp , it do
not setCallFrameSzie for the new MachineBasicBlock `NewMBB`, we will
setCallFrameSzie in the patch.
the patch fix the crash mention in
https://github.com/llvm/llvm-project/pull/144594#issuecomment-2993736654
This PR depends on #148165; the first commit
(90f1d0a881a21a8b4f192622d798c290770fda63) belongs to that PR. The
changes are distinct, so separate PRs seemed like the best option. I
don't have commit access, so I couldn't use user-branches to mark the
dependency.
As AMDGPULateCodeGenPrepare actually performs changes that invalidate
Uniformity Analysis; use `setPreservesCFG()` to mark this, instead of
`setPreservesAll()` which wrongly includes preserving Uniformity
Analysis.
Note that before #148165, this would still have preserved Uniformity
Analysis, hence the dependency. In addition, `amdgpu/llc-pipeline.cc`
needs to be changed when both changes are in effect, but those changes
would make the test fail if the PRs weren't based on one another.
Note on why this hasn't caused issues so far:
It just so happens that AMDGPULateCodeGenPrepare is always immediately
followed by AMDGPUUnifyDivergentExitNodes, which *does* invalidate most
analyses, including Uniformity. And because UnifyDivergentExitNodes only
looks at terminators, and LateCGP seemingly does not replace uniform
values with divergent values, or divergent values with uniform values,
and it only *inserts new values that are not looked at by
UnifyDivergentExitNodes*, this bug remained hidden.
---
I ran `git-clang-format` on my changes. I tested them using the
`check-llvm` target; no unexpected failures occurred after I made the
change to `amdgpu/llc-pipeline.ll`.
For loops such as this:
```
struct foo {
double a, b;
};
void foo(struct foo *dst, struct foo *src, int n) {
for (int i = 0; i < n; i++) {
dst[i].a += src[i].a * 3.2;
dst[i].b += src[i].b * 3.2;
}
}
```
the complex deinterleaving pass will spot that the deinterleaving
associated with the structured loads cancels out the interleaving
associated with the structured stores. This happens even though
they are not truly "complex" numbers because the pass can handle
symmetric operations too. This is great because it means we can
then perform normal loads and stores instead. However, we can also
do the same for higher interleave factors, e.g. 4:
```
struct foo {
double a, b, c, d;
};
void foo(struct foo *dst, struct foo *src, int n) {
for (int i = 0; i < n; i++) {
dst[i].a += src[i].a * 3.2;
dst[i].b += src[i].b * 3.2;
dst[i].c += src[i].c * 3.2;
dst[i].d += src[i].d * 3.2;
}
}
```
This PR extends the pass to effectively treat such structures as
a set of complex numbers, i.e.
```
struct foo_alt {
std::complex<double> x, y;
};
```
with equivalence between members:
```
foo_alt.x.real == foo.a
foo_alt.x.imag == foo.b
foo_alt.y.real == foo.c
foo_alt.y.imag == foo.d
```
I've written the code to handle sets with arbitrary numbers of
complex values, but since we only support interleave factors
between 2 and 4 I've restricted the sets to 1 or 2 complex
numbers. Also, for now I've restricted support for interleave
factors of 4 to purely symmetric operations only. However, it
could also be extended to handle complex multiplications,
reductions, etc.
Fixes: https://github.com/llvm/llvm-project/issues/144795
We only do conditional streaming mode changes in two cases:
- Around calls in streaming-compatible functions that don't have a
streaming body
- At the entry/exit of streaming-compatible functions with a streaming
body
In both cases, the condition depends on the entry pstate.sm value. Given
this, we don't need to emit calls to __arm_sme_state at every mode
change.
This patch handles this by placing a "AArch64ISD::ENTRY_PSTATE_SM" node
in the entry block and copying the result to a register. The register is
then used whenever we need to emit a conditional streaming mode change.
The "ENTRY_PSTATE_SM" node expands to a call to "__arm_sme_state" only
if (after SelectionDAG) the function is determined to have
streaming-mode changes.
This has two main advantages:
1. It allows back-to-back conditional smstart/stop pairs to be folded
2. It has the correct behaviour for EH landing pads
- These are entered with pstate.sm = 0, and should switch mode based on
the entry pstate.sm
- Note: This is not fully implemented yet
For flat memory instructions where the address is supplied as a base address
register with an immediate offset, the memory aperture test ignores the
immediate offset. Currently, ISel does not respect that, which leads to
miscompilations where valid input programs crash when the address computation
relies on the immediate offset to get the base address in the proper memory
aperture. Global or scratch instructions are not affected.
This patch only selects flat instructions with immediate offsets from address
computations with the inbounds flag: If the address computation does not leave
the bounds of the allocated object, it cannot leave the bounds of the memory
aperture and is therefore safe to handle with an immediate offset.
Relevant tests are in fold-gep-offset.ll.
Analogous to #132353 for SDAG (which is not yet in a mergeable state, its
progress is currently blocked by #146076).
Fixes SWDEV-516125 for GISel.