The intrinsic is called llvm.ceil not llvm.fceil. The checks weren't
strong enough to notice that a call to llvm.fceil was emitted in
the final assembly.
This is a small extension of D112095 to avoid another regression
seen with D112085.
In this case, we allow the same conversion from usubsat to ALU
ops if the target supports vpternlog.
That pattern will get converted later in X86DAGToDAGISel::tryVPTERNLOG().
This seems better than putting a magic immediate constant directly in
this code to create the exact vpternlog that we need. It's possible that
there are other special-cases along these lines, so we should try to
keep all of the vpternlog magic in one place.
Differential Revision: https://reviews.llvm.org/D112138
MachineLoop::isLoopInvariant() returns false for all VALU
because of the exec use. Check TII::isIgnorableUse() to
allow hoisting.
That unfortunately results in higher register consumption
since MachineLICM does not adequately estimate pressure.
Therefor I think it shall only be enabled after D107677 even
though it does not depend on it.
Differential Revision: https://reviews.llvm.org/D107859
D106408 was doing this for all targets although it was
reverted due to couple performance regressions on some targets.
The difference for AMDGPU is the ability to rematerialize SOP
instructions with virtual register uses like we already do for VOP.
Differential Revision: https://reviews.llvm.org/D110743
Add relaxed. f32x4.min, f32x4.max, f64x2.min, f64x2.max. These are only
exposed as builtins, and require user opt-in.
Differential Revision: https://reviews.llvm.org/D112146
Our fallback expansion for CTLZ/CTTZ relies on CTPOP. If CTPOP
isn't legal or custom for a vector type we would scalarize the
CTLZ/CTTZ. This is different than CTPOP itself which would use a
vector expansion.
This patch teaches expandCTLZ/CTTZ to rely on the vector CTPOP
expansion instead of scalarizing. To do this I had to add additional
checks to make sure the operations used by CTPOP expansions are all
supported. Some of the operations were already needed for the CTLZ/CTTZ
expansion.
This is a huge improvement to the RISCV which doesn't have a scalar
ctlz or cttz in the base ISA.
For WebAssembly, I've added Custom lowering to keep the scalarizing
behavior. I've also extended the scalarizing to CTPOP.
Differential Revision: https://reviews.llvm.org/D111919
When inserting a scalable subvector into a scalable vector through
the stack, the index to store to needs to be scaled by vscale.
Before this patch, that didn't yet happen, so it would generate the
wrong offset, thus storing a subvector to the incorrect address
and overwriting the wrong lanes.
For some insert:
nxv8f16 insert_subvector(nxv8f16 %vec, nxv2f16 %subvec, i64 2)
The offset was not scaled by vscale:
orr x8, x8, #0x4
st1h { z0.h }, p0, [sp]
st1h { z1.d }, p1, [x8]
ld1h { z0.h }, p0/z, [sp]
And is changed to:
mov x8, sp
st1h { z0.h }, p0, [sp]
st1h { z1.d }, p1, [x8, #1, mul vl]
ld1h { z0.h }, p0/z, [sp]
Differential Revision: https://reviews.llvm.org/D111633
autiasp, autibsp instructions are the counterpart of paciasp/pacibsp instructions
therefore let's emit .cfi_negate_ra_state for these too.
In case of Armv8.3 instruction set the retaa/retbb will do the return and authentication
in one step here we can't emit the . cfi_negate_ra_state because that would be point after
the ret* instruction.
Reviewed By: nickdesaulniers, MaskRay
Differential Revision: https://reviews.llvm.org/D111780
This change implements new DAG nodes TABLE_GET/TABLE_SET, and lowering
methods for load and stores of reference types from IR arrays. These
global LLVM IR arrays represent tables at the Wasm level.
Differential Revision: https://reviews.llvm.org/D111154
Add i8x16 relaxed_swizzle instructions. These are only
exposed as builtins, and require user opt-in.
Differential Revision: https://reviews.llvm.org/D112022
Currently, .BTF and .BTF.ext has default alignment of 1.
For example,
$ cat t.c
int foo() { return 0; }
$ clang -target bpf -O2 -c -g t.c
$ llvm-readelf -S t.o
...
Section Headers:
[Nr] Name Type Address Off Size ES Flg Lk Inf Al
...
[ 7] .BTF PROGBITS 0000000000000000 000167 00008b 00 0 0 1
[ 8] .BTF.ext PROGBITS 0000000000000000 0001f2 000050 00 0 0 1
But to have no misaligned data access, .BTF and .BTF.ext
actually requires alignment of 4. Misalignment is not an issue
for architecture like x64/arm64 as it can handle it well. But
some architectures like mips may incur a trap if .BTF/.BTF.ext
is not properly aligned.
This patch explicitly forced .BTF and .BTF.ext alignment to be 4.
For the above example, we will have
[ 7] .BTF PROGBITS 0000000000000000 000168 00008b 00 0 0 4
[ 8] .BTF.ext PROGBITS 0000000000000000 0001f4 000050 00 0 0 4
Differential Revision: https://reviews.llvm.org/D112106
usubsat X, SMIN --> (X ^ SMIN) & (X s>> BW-1)
This would be a regression with D112085 where we combine to
usubsat more aggressively, so avoid that by matching the
special-case where we are subtracting SMIN (signmask):
https://alive2.llvm.org/ce/z/4_3gBD
Differential Revision: https://reviews.llvm.org/D112095
Following on from an earlier patch that introduced support for -mtune
for AArch64 backends, this patch splits out the tuning features
from the processor features. This gives us the ability to enable
architectural feature set A for a given processor with "-mcpu=A"
and define the set of tuning features B with "-mtune=B".
It's quite difficult to write a test that proves we select the
right features according to the tuning attribute because most
of these relate to scheduling. I have created a test here:
CodeGen/AArch64/misched-fusion-addr-tune.ll
that demonstrates the different scheduling choices based upon
the tuning.
Differential Revision: https://reviews.llvm.org/D111551
GPR uses argument registers as the first group of registers to allocate.
This patch uses vector argument registers, v8 to v23, as the first group
to allocate.
Differential Revision: https://reviews.llvm.org/D111304
Right now when we see -O# we add the corresponding 'default<O#>' into
the list of passes to run when translating legacy -pass-name. This has
the side effect of not using the default AA pipeline.
Instead, treat -O# as -passes='default<O#>', but don't allow any other
-passes or -pass-name. I think we can keep `opt -O#` as shorthand for
`opt -passes='default<O#>` but disallow anything more than just -O#.
Tests need to be updated to not use `opt -O# -pass-name`.
Reviewed By: asbirlea
Differential Revision: https://reviews.llvm.org/D112036
By default clang emits complete contructors as alias of base constructors if they are the same.
The backend is supposed to emit symbols for the alias, otherwise it causes undefined symbols.
@yaxunl observed that this issue is related to the llvm options `-amdgpu-early-inline-all=true`
and `-amdgpu-function-calls=false`. This issue is resolved by only inlining global values
with internal linkage. The `getCalleeFunction()` in AMDGPUResourceUsageAnalysis also had
to be extended to support aliases to functions. inline-calls.ll was corrected appropriately.
Reviewed By: yaxunl, #amdgpu
Differential Revision: https://reviews.llvm.org/D109707
If we're using an ashr to sign-extend the entire upper 16 bits of the i32 element, then we can replace with a lshr. The sign bit will be correctly shifted for PMADDWD's implicit sign-extension and the upper 16 bits are zero so the upper i16 sext-multiply is guaranteed to be zero.
The lshr also has a better chance of folding with shuffles etc.
The feature tells the backend to allow tags in the upper bits of global
variable addresses. These tags will be ignored by upcoming CPUs with
the Intel LAM feature but may be used in instrumentation passes (e.g.,
HWASan).
This patch implements the feature by using @GOTPCREL relocations instead
of direct references to the locally defined global. Thus the full
tagged address can be loaded by a single instruction:
movq global@GOTPCREL(%rip), %rax
Reviewed By: eugenis
Differential Revision: https://reviews.llvm.org/D111343
Paul Chaignon reported a bpf verifier failure ([1]) due to using
non-ABI register R11. For the test case, llvm11 is okay while
llvm12 and later generates verifier unfriendly code.
The failure is related to variable length array size.
The following mimics the variable length array definition
in the test case:
struct t { char a[20]; };
void foo(void *);
int test() {
const int a = 8;
char tmp[AA + sizeof(struct t) + a];
foo(tmp);
...
}
Paul helped bisect that the following llvm commit is
responsible:
552c6c232872 ("PR44406: Follow behavior of array bound constant
folding in more recent versions of GCC.")
Basically, before the above commit, clang frontend did constant
folding for array size "AA + sizeof(struct t) + a" to be 68,
so used alloca for stack allocation. After the above commit,
clang frontend didn't do constant folding for array size
any more, which results in a VLA and llvm.stacksave/llvm.stackrestore
is generated.
BPF architecture API does not support stack pointer (sp) register.
The LLVM internally used R11 to indicate sp register but it should
not be in the final code. Otherwise, kernel verifier will reject it.
The early patch ([2]) tried to fix the issue in clang frontend.
But the upstream discussion considered frontend fix is really a
hack and the backend should properly undo llvm.stacksave/llvm.stackrestore.
This patch implemented a bpf IR phase to remove these intrinsics
unconditionally. If eventually the alloca can be resolved with
constant size, r11 will not be generated. If alloca cannot be
resolved with constant size, SelectionDag will complain, the same
as without this patch.
[1] https://lore.kernel.org/bpf/20210809151202.GB1012999@Mem/
[2] https://reviews.llvm.org/D107882
Differential Revision: https://reviews.llvm.org/D111897
The MIPS ABI requires the thread pointer be accessed via rdhwr $3, $r29.
This is currently represented by (CopyToReg $3, (RDHWR $29)) followed by
a (CopyFromReg $3). However, there is no glue between these, meaning
scheduling can break those apart. In particular, PR51691 is a report
where PseudoSELECT_I was moved to between the CopyToReg and CopyFromReg,
and since its expansion uses branches, it split the def and use of the
physical register between two basic blocks, resulting in the def being
eliminated and the use having no def. It also seems possible that a
similar situation could arise splitting up the CopyToReg from the RDHWR,
causing the RDHWR to use a destination register other than $3, violating
the ABI requirement.
Thus, add glue between all three nodes to ensure they aren't split up
during instruction selection. No regression test is added since any test
would be implictly relying on specific scheduling behaviour, so whilst
it might be testing that glue is preventing reordering today, changes to
scheduling behaviour could result in the test no longer being able to
catch a regression here, as the reordering might no longer happen for
other unrelated reasons.
Fixes PR51691.
Reviewed By: atanasyan, dim
Differential Revision: https://reviews.llvm.org/D111967
Try to widen element type to get a new mask value for a better permutation
sequence, so that we can use NEON shuffle instructions, such as zip1/2,
UZP1/2, TRN1/2, REV, INS, etc.
For example:
shufflevector <4 x i32> %a, <4 x i32> %b, <4 x i32> <i32 6, i32 7, i32 2, i32 3>
is equivalent to:
shufflevector <2 x i64> %a, <2 x i64> %b, <2 x i32> <i32 3, i32 1>
Finally, we can get:
mov v0.d[0], v1.d[1]
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D111619
Add patterns for i8/i16 local atomic load/store.
Added tests for new patterns.
Copied atomic_[store/load]_local.ll to GlobalISel directory.
Differential Revision: https://reviews.llvm.org/D111869
The process of widening simple vector loads attempts to use a load of a
wider vector type if the original load is sufficiently aligned to avoid
memory faults.
However this optimization is only legal when performed on fixed-length
vector types. For scalable vector types this is invalid (unless vscale
happens to be 1).
This patch does increase the likelihood of compiler crashes (from
`FindMemType` failing to find a suitable type) but this now better
matches how widening non-simple loads, insufficiently-aligned loads, and
scalable-vector stores are handled.
Patches will be introduced later by which loads and stores can be
widened on targets with support for masked or predicated operations.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D111885
The change adds divergence predicates for fused logical operations.
The problem with selecting a scalar fused op such as S_NOR_B32 is
that it does not have a VALU counterpart and will be split in
moveToVALU. At the same time it prevents selection of a better
opcode on the VALU side (such as V_OR3_B32) which does not have a
counterpart on SALU side.
XNOR opcodes are left as is and selected as scalar to get advantage
of the SIInstrInfo::lowerScalarXnor() code which can commute
operations to keep one of two opcodes on SALU if possible. See
xnor.ll test for this.
Differential Revision: https://reviews.llvm.org/D111907
SVE has predicated literal forms of some instructions for specific
literals, which currently are generated correctly when using ACLE
but not when those instructions are generated directly.
This adds the patterns to generate those instructions when
generating from standard LLVM IR instructions.
Differential Revision: https://reviews.llvm.org/D99074
This patch detects the absolute value pattern on the RHS of a
subtract. If we find it we swap the CMOV true/false values and
replace the subtract with an ADD.
There may be a more generic way to do this, but I'm not sure.
Targets that don't have legal or custom ISD::ABS use a generic
expand in DAG combiner already when it sees (neg (abs(x))). I
haven't checked what happens if the neg is a more general subtract.
Fixes PR50991 for X86.
Reviewed By: RKSimon, spatel
Differential Revision: https://reviews.llvm.org/D111858
By default clang emits complete contructors as alias of base constructors if they are the same.
The backend is supposed to emit symbols for the alias, otherwise it causes undefined symbols.
@yaxunl observed that this issue is related to the llvm options `-amdgpu-early-inline-all=true`
and `-amdgpu-function-calls=false`. This issue is resolved by only inlining global values
with internal linkage. The `getCalleeFunction()` in AMDGPUResourceUsageAnalysis also had
to be extended to support aliases to functions. inline-calls.ll was corrected appropriately.
Reviewed By: yaxunl, #amdgpu
Differential Revision: https://reviews.llvm.org/D109707
- When a redundant MBB is being erased from MDT, check whether its
single successor is dominiated by it. If yes, update that successor's
idom before erasing MBB; otherwise, it implies MBB is a leaf node and
could be erased directly.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D111831
This patch remove the override in AIX target,
so the int128 is enabled in 64 bit mode or with ForceEnableInt128.
Reviewed By: lkail
Differential Revision: https://reviews.llvm.org/D111078
This changes fixes a case in which the highest set bit of the original
result is at bit 31 and sign-extending the mul24 for it would make the
result negative.
Differential Revision: https://reviews.llvm.org/D111823