Across-lane intrinsics with integer destination type (uaddv, saddv,
umaxv, smavx, uminv, sminv) were legalized with the destination type
given in the LLVM IR intrinsic’s definition. It was wider than
the actual destination type of the corresponding machine instruction.
InstructionSelect was implicitly supposed to generate underlying
extension instructions for these intrinsics, while the real destination
type was opaque for other GlobalISel passes. Thus,
llvm/test/CodeGen/AArch64/arm64-vaddv.ll failed on GlobalISel since
the generated code was worse in functions that used the value of
an across-lane intrinsic in following FP&SIMD instructions (functions
with _used_by_laneop suffix).
Here intrinsics are legalized and selected with an actual destination
type, making it transparent to other passes. If the destination
value is used in further instructions accepting FPR registers, there
won’t be extra copies across register banks. i16 type is added to
the list of the types of the FPR16 register bank to make it possible,
and a few SelectionDAG patterns are modified to eliminate ambiguity
in TableGen.
Differential Revision: https://reviews.llvm.org/D156831
This extends the lowering of ceil, floor, nearbyint, rint, round, roundeven and
trunc. They are all very similar, so can reuse the same legalization info.
selectIntrinsicTrunc and selectIntrinsicRound can be removed as they can be
selected via tablegen patterns, and G_INTRINSIC_ROUNDEVEN is marked as a gisel
equivalent of froundeven. Otherwise this reuses the existing code, filling it
out to handle more types.
Differential Revision: https://reviews.llvm.org/D157679
The subtarget was unconditionally reporting that SVE was to be used to
lower vectors when Neon was unavailable, even when SVE itself was
unavailable. This decision leads other parts of the compiler to crash,
e.g., when querying SVE vector sizes.
Reviewed By: sdesmalen
Differential Revision: https://reviews.llvm.org/D158179
This will be useful to avoid the bit-casting noise
required to extend support for Floating Point
Operations in atomic optimizer for DPP in D156301
Reviewed By: arsenm, #amdgpu
Differential Revision: https://reviews.llvm.org/D156647
The wwm register spill pseudos are currently defined for VGPR_32
regclass. It causes a verifier error for gfx908 or above as the
regalloc sometimes restores the values to the vector superclass AV_32.
Fixing it by supporting AV wwm-spill pseudos as well.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D155646
Keep components of UNMERGE larger after running the Artifact Combiner on it.
This was intended to help with <v16i64> = G_SEXT <v16i16>, but implementation
for legalizing EXT is in a following patch, therefore a test for this case
will be included in the following patch
Differential Revision: https://reviews.llvm.org/D157715
Don't prematurely fold TRUNCATE nodes to PACKSS/PACKUS target nodes - we miss out on generic TRUNCATE folds.
Helps some regressions from D152928 and #63946Fixes#63710
When building a fatbinary, the driver invokes the compiler multiple
times with different "--target". (For example, with "-x cuda
--cuda-gpu-arch=sm_70" flags, clang will be invoded twice, once with
--target=x86_64_...., once with --target=sm_70) If we use
-fsplit-machine-functions or -fno-split-machine-functions for such
invocation, the driver reports an error.
This CL changes the behavior so:
- "-fsplit-machine-functions" is now passed to all targets, for non-X86
targets, the flag is a NOOP and causes a warning.
- "-fno-split-machine-functions" now negates -fsplit-machine-functions (if
-fno-split-machine-functions appears after any -fsplit-machine-functions)
for any target triple, previously, it causes an error.
- "-fsplit-machine-functions -Xarch_device -fno-split-machine-functions"
enables MFS on host but disables MFS for GPUS without warnings/errors.
- "-Xarch_host -fsplit-machine-functions" enables MFS on host but disables
MFS for GPUS without warnings/errors.
Reviewed by: xur, dhoekwater
Differential Revision: https://reviews.llvm.org/D157750
D52985/D57677 added a .gcc_except_table workaround, but the new behavior
doesn't match GNU assembler.
```
void foo();
int bar() {
foo();
try { throw 1; }
catch (int) { return 1; }
return 0;
}
clang --target=mipsel-linux-gnu -mmicromips -S a.cc
mipsel-linux-gnu-gcc -mmicromips -c a.s -o gnu.o
.uleb128 ($cst_end0)-($cst_begin0) // bit 0 is not forced to 1
.uleb128 ($func_begin0)-($func_begin0) // bit 0 is not forced to 1
```
I have inspected `.gcc_except_table` output by `mipsel-linux-gnu-gcc -mmicromips -c a.cc`.
The `.uleb128` values are not forced to set the least significant bit.
In addition, D57677's adjustment (even->odd) to CodeGen/Mips/micromips-b-range.ll is wrong.
PC-relative `.long func - .` values will differ from GNU assembler as well.
The original intention of D52985 seems unclear to me. I think whatever
goal it wants to achieve should be moved to an upper layer.
This isMicroMips special case has caused problems to fix MCAssembler::relaxLEB to use evaluateAsAbsolute instead of evaluateKnownAbsolute,
which is needed to proper support R_RISCV_SET_ULEB128/R_RISCV_SUB_ULEB128.
Differential Revision: https://reviews.llvm.org/D157655
When emitting assembly we don't particularly want the binary DXIL
embedded in the output. This was mostly there for testing purposes, so
we update those tests to run the test directly using `opt` and
restrict the -dxil-embed and -dxil-globals passes to running normally
only in the case where we're trying to emit a DXContainer.
Differential Revision: https://reviews.llvm.org/D158051
Machine function splitting will become available for AArch64; since MFS
is no longer X86-only, the tests for generic behavior should live
somewhere other than tests/CodeGen/X86.
MFS implementation doesn't vary much across platforms, and most tests
should be identical between X86 and AArch64 besides instruction
selection, so the tests can live together in tests/CodeGen/Generic.
Differential Revision: https://reviews.llvm.org/D157563
Aligning functions yields small performance gains on
embedded cores, moreso with numerous small function calls.
Similar to aligning loops, if the function can fit within
a single cache line then the performance overhead of
fetching more instructions can be limited.
Differential Revision: https://reviews.llvm.org/D157514
This allows us to select G_SHUFFLE_VECTOR with identity masks (possibly
including undef elements), but avoid the actual EXT instruction if the shift
amount is 0.
Generate store immediate instruction when CPUv4 is enabled.
For example:
$ cat test.c
struct foo {
unsigned char b;
unsigned short h;
unsigned int w;
unsigned long d;
};
void bar(volatile struct foo *p) {
p->b = 1;
p->h = 2;
p->w = 3;
p->d = 4;
}
$ clang -O2 --target=bpf -mcpu=v4 test.c -c -o - | llvm-objdump -d -
...
0000000000000000 <bar>:
0: 72 01 00 00 01 00 00 00 *(u8 *)(r1 + 0x0) = 0x1
1: 6a 01 02 00 02 00 00 00 *(u16 *)(r1 + 0x2) = 0x2
2: 62 01 04 00 03 00 00 00 *(u32 *)(r1 + 0x4) = 0x3
3: 7a 01 08 00 04 00 00 00 *(u64 *)(r1 + 0x8) = 0x4
4: 95 00 00 00 00 00 00 00 exit
Take special care to:
- apply `BPFMISimplifyPatchable::checkADDrr` rewrite for BPF_ST
- validate immediate value when BPF_ST write is 64-bit:
BPF interprets `(BPF_ST | BPF_MEM | BPF_DW)` writes as writes with
sign extension. Thus it is fine to generate such write when
immediate is -1, but it is incorrect to generate such write when
immediate is +0xffff_ffff.
This commit was previously reverted in e66affa17e32.
The reason for revert was an unrelated bug in BPF backend,
triggered by test case added in this commit if LLVM is built
with LLVM_ENABLE_EXPENSIVE_CHECKS.
The bug was fixed in D157806.
Differential Revision: https://reviews.llvm.org/D140804
vmv.x.s and vmv.f.s are unconditional. They read the low element of a vector
register (not vector group), and function even when VL=0 or VSTART>0. As such,
they are don't care with respect to both VL and LMUL.
We'd previously had handling in the forward pass only via the NoRegister
mechanusm. (The only instructions with SEW but without VL are these extracts.)
This patch moves that handling into getDemanded so that the backwards pass
benefits as well.
Differential Revision: https://reviews.llvm.org/D157991
We were defaulting to VL=0 when we didn't otherwise have a vsetv
nearby. Instead, let's use VL=1. VL=0 is very much a cornercase
in hardware, and let's avoid if we can.
Differential Revision: https://reviews.llvm.org/D158015
Apparently the spec has overloads for fmin/fmax and ldexp with one of
the operands as scalar. We need to broadcast the scalars to the vector
type.
https://reviews.llvm.org/D158077
Fix verification that a PhysReg is live in to an MBB.
isLiveIn does not handle reg units, so cannot identify when a
register would be defined because its super register is partially
defined.
Additionally a PhysReg may be partial defined at block entry and
then fully defined before any use.
Reviewed By: foad, arsenm
Differential Revision: https://reviews.llvm.org/D157086
As far as I understand - When lowering a G_CONSTANT_FOLD_BARRIER we replace the
DstReg with SrcReg, and need to check that the register class is equivalent
when doing so for the replacement to be legal. During lowering we could end up
visiting nodes in an odd order, leaving a G_CONSTANT_FOLD_BARRIER with a known
regclass for the src, but only a regbank for the dst. Providing the Regbank
contains the regclass, the replacement should still be safe.
This fixes an assert seen in the llvm-test-suite when lowering hoisted
constants, relaxing canReplaceReg to account for the case when the regbank
covers the regclass, so it is better able to handle differences in visiting
order.
Differential Revision: https://reviews.llvm.org/D157202
On X86 for vec types `(X & Y) == Y` is generally preferable to
`(X & Y) != 0`. Creating zero requires an extra instruction and on
pre-avx512 targets there is no vector `pcmpne` so it requires two
additional instructions to invert the `pcmpeq`.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D157014
1) Handle casts a bit more cleanly just with a loop rather than with
recursion.
2) Add additional cases for smin/smax
3 ) For shifts we can also deduce non-zero if the maximum shift amount
on the known 1s is non-zero.
Differential Revision: https://reviews.llvm.org/D156777
This was added originally as the test was failing on NVPTX before an
explicit target triple was set on the llc invocation. The test was fixed
in 4afb1ee7bc0e3674238da2d3668f8d8b80596c62 but the unsupported
directive was never removed.
Because the code layout is not known during compilation, the distance of
cross-section jumps is not knowable at compile-time. Because of this, we
should assume that any cross-sectional jumps are out of range. This
assumption is necessary for machine function splitting on AArch64, which
introduces cross-section branches in the middle of functions. The linker
relaxes out-of-range unconditional branches, but it clobbers X16 to do
so; it doesn't relax conditional branches, which must be manually
relaxed by the compiler.
Differential Revision: https://reviews.llvm.org/D145211
Machine function splitting will become available for AArch64; since MFS
is no longer X86-only, the tests for generic behavior should live
somewhere other than tests/CodeGen/X86.
MFS implementation doesn't vary much across platforms, and most tests
should be identical between X86 and AArch64 besides instruction
selection, so the tests can live together in tests/CodeGen/Generic.
Differential Revision: https://reviews.llvm.org/D157563
Move the sra(shl(x,c1),c1) -> sign_extend_inreg(x) fold inside SimplifyDemandedBits so we can recognize hidden splats with DemandedElts masks.
Because the c1 shift amount has multiple uses, hidden splats won't get simplified to a splat constant buildvector - meaning the existing fold in DAGCombiner::visitSRA can't fire as it won't see a uniform shift amount.
I also needed to add TLI preferSextInRegOfTruncate hook to help keep truncate(sign_extend_inreg(x)) vector patterns on X86 so we can use PACKSS more efficiently.
Differential Revision: https://reviews.llvm.org/D157972
These cmp patterns were using the pre-GFX11 pseudo instruction, and so
failed to compile.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D157912
Mirror of the previous log changes, OpenCL conformance doesn't like
interpreting afn as ignore denormal handling but was previously hidden
by flag dropping.
Apparently afn doesn't allow you to drop the denormal handling
according to OpenCL conformance. This was hidden by losing the flags
during the library linking process. Fast log is still broken and needs
more work.
https://reviews.llvm.org/D157936