V implies Zvl128b, but a lot of the fixed vector tests also redundantly
specify -riscv-v-vector-bits-min=128. This patch removes them where
there isn't another minimum vlen being tested for, and for cases where
Zve* is being used Zvl128b was added to maintain the old test diff (and
because an awkward vlen probably isn't interesting to test for). Other
places where -risc-v-vector-bits-min were being used were replaced with
Zvl.
Assuming the ADD is nsw then it may be sign-extended to merge with a SHL op in a similar fold to the existing (shl (add x, c1), c2) -> (add (shl x, c2), c1 << c2) fold.
This is most useful for helping to expose address math for X86, but has also touched several aarch64 test cases as well.
Alive2: https://alive2.llvm.org/ce/z/2UpSbJ
Differential Revision: https://reviews.llvm.org/D159198
In D158086, we limit all floating point scalar move and splat can't fuse
vsetvli with different SEW, and this patch try to relax the constraint
as possible by introducing new SEW demand type:
SEWGreaterThanOrEqualAndLessThan64, that allow SEW fused with larger
SEW, but constraint it can't fused with SEW=64.
Reviewed By: rogfer01
Differential Revision: https://reviews.llvm.org/D158177
addi sp, sp, 512 may be used to recover the sp in the epilogue
when stack size is larger than 2047(2^11 - 1), however, it can
not be compressed using C extension, and addi sp, sp, 496 is
able to be compressed, so try to use 496 as the ajust amount of
the fisrt sp if function doesn't need extra instructions after
adjust.
Reviewed By: wangpc
Differential Revision: https://reviews.llvm.org/D159431
On PPC there are instructions to store element from vector(e.g.
stxsdx/stxsiwx), and these instructions can be leveraged to avoid tail
constant in memset and constant splat array initialization.
This patch tries to explore these opportunities.
Reviewed By: shchenz
Differential Revision: https://reviews.llvm.org/D138883
Compiling with TLS variables requires -pthread, but if the user omits this
option, the compiler will not show any obvious indication during compilation
that -pthread is needed for programs using TLS variables. Instead, the user will
experience a segmentation fault when running programs with TLS variables in them
and without specifying -pthread.
This patch aims to generate .extern/.ref references to __tls_get_addr[DS] for
local-exec accesses, in order to trigger an error from the linker to indicate
that there is an undefined symbol to __tls_get_addr. Doing so will remind the
user to compile/link with -pthread.
Differential Revision: https://reviews.llvm.org/D151335
Each vslide1down operation is linear in LMUL on common hardware. (For instance, the sifive-x280 cost model models slides this way.) If we do a VL unique inserts, each with a cost linear in LMUL, the overall cost is O(VL*LMUL). Since VL is a linear function of LMUL, this means the current lowering is quadradic in both LMUL and VL. To avoid the degenerate case, fallback to the stack if the cost is more than a fixed (linear) threshold.
For context, here's the sifive-x280 llvm-mca results for the current lowering and stack based lowering for each LMUL (using e64). Assumes code was compiled for V (i.e. zvl128b).
buildvector_m1_via_stack.mca:Total Cycles: 1904
buildvector_m2_via_stack.mca:Total Cycles: 2104
buildvector_m4_via_stack.mca:Total Cycles: 2504
buildvector_m8_via_stack.mca:Total Cycles: 3304
buildvector_m1_via_vslide1down.mca:Total Cycles: 804
buildvector_m2_via_vslide1down.mca:Total Cycles: 1604
buildvector_m4_via_vslide1down.mca:Total Cycles: 6400
buildvector_m8_via_vslide1down.mca:Total Cycles: 25599
There are other schemes we could use to cap the cost. The next best is recursive decomposition of the vector into smaller LMULs. That's still quadratic, but with a better constant. However, stack based seems to cost better on all LMULs, so we can just go with the simpler scheme.
Arguably, this patch is fixing a regression introduced with my D149667 as before that change, we'd always fallback to the stack, and thus didn't have the non-linearity.
Differential Revision: https://reviews.llvm.org/D159332
When having <N x t> d1, unused = unmerge(G_EXT <2*N x t> v1, undef, N),
it is possible to express it just as unused, d1 = unmerge v1.
It is useful for tackling regressions in arm64-vcvt_f.ll, introduced in
https://reviews.llvm.org/D144670.
When having <N x t> d1, unused = unmerge(G_EXT <2*N x t> v1, undef, N),
it is possible to express it just as unused, d1 = unmerge v1.
It is useful for tackling regressions in arm64-vcvt_f.ll, introduced in
https://reviews.llvm.org/D144670.
If there are copy instructions between uaddlv and dup for transfer from gpr to
fpr, try to remove them with duplane.
Differential Revision: https://reviews.llvm.org/D159267
In some cases where the same mask is used for multiple
extending masked loads it can be more efficient to combine
the zero- or sign-extend into the load even if it's not a
legal or custom operation. This leads to splitting up the
extending load into smaller parts, which also requires
splitting the mask. For SVE at least this improves the
performance of the SPEC benchmark x264 slightly on
neoverse-v1 (~0.3%), and at least one other benchmark
improves by around 30%. The uplift for SVE seems due to
removing the dependencies (vector unpacks) introduced
between the loads and the vector operations, since this
should increase the level of parallelism.
See tests:
CodeGen/AArch64/sve-masked-ldst-sext.ll
CodeGen/AArch64/sve-masked-ldst-zext.ll
https://reviews.llvm.org/D159191
I've added some missing tests for the following cases:
1. Zero- and sign-extends from unpacked vector types to wide,
illegal types. For example,
%aext = zext <vscale x 4 x i8> %a to <vscale x 4 x i64>
2. Normal loads combined with 1
3. Masked loads combined with 1
Differential Revision: https://reviews.llvm.org/D159192
Adds a new feature to MIR patterns: builtin instructions.
They offer some additional capabilities that currently cannot be expressed without falling back to C++ code.
There are two builtins added with this patch, but more can be added later as new needs arise:
- GIReplaceReg
- GIEraseRoot
Depends on D158714, D158713
Reviewed By: arsenm, aemerson
Differential Revision: https://reviews.llvm.org/D158975
mffsl is available since ISA 3.0. The builtin is named with ppc prefix
to follow our convention. For targets earlier than power9, GCC generates
extra code to support the functionality, while this patch does not
implement such behavior.
Reviewed By: nemanjai, tuliom
Differential Revision: https://reviews.llvm.org/D158065
When calling a function which requires no streaming-mode change from an
__arm_locally_streaming function, LLVM would otherwise emit:
// function prologue
smstart
b streaming_compatible_function // tail call
// never an smstop
R12 is callee-saved in functions with pacbti-m enabled, but this is
done in assignCalleeSavedSpillSlots, meaning that in
determineCalleeSaves we have to manually set CanEliminateFrame.
This fixes a bug where in leaf functions with no other callee-saved
registers the aut instruction wouldn't be emitted and stack offsets
of arguments passed on the stack would be incorrect.
Differential Revision: https://reviews.llvm.org/D157865
If no frame-pointer is available and the compiler has scavenged a
spill-slot in the callee-save area, the compiler may be forced to emit an
'addvl' inside the streaming-mode-changing call sequence when it needs to
fill (reload) an FP register being passed to the call.
We can avoid this entirely by disabling stack-slot scavenging when there
are streaming-mode-changing call-sequences in the function.
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D159196
Now that the codegen for the expanded ISD::ROTL sequence has been improved,
it's probably profitable to lower a shuffle that's a rotate to the
vsll+vsrl+vor sequence to avoid a vrgather where possible, even if we don't
have the vror instruction.
This patch relaxes the restriction on ISD::ROTL being legal in
lowerVECTOR_SHUFFLEAsRotate. It also attempts to do the lowering twice: Once
if zvbb is enabled before any of the interleave/deinterleave/vmerge lowerings,
and a second time unconditionally just before it falls back to the vrgather.
This way it doesn't interfere with any of the above patterns that may be more
profitable than the expanded ISD::ROTL sequence.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D159353
uses when looking for load/store users. This was a simple logic bug during translation
of the equivalent function in SelectionDAG:
```
for (SDNode *Node : N->uses()) {
if (auto *LoadStore = dyn_cast<MemSDNode>(Node)) {
```
The lookthrough int<->ptr cast tests and code were both wrongly checking the wrong
register uses. This change is fixing and precommiting the test to prepare for
the code fix.
We can just bitcast the pre-shifted mask as an integer and use TEST/BT directly.
Reapplied with fix for 239ab16ec121 which didn't set the comparison type correctly
When emit NoOp bitcast for GEP Ptr Operand, should use SourceElementType
instead of ResultElementType.
**Behavior Before Change**
Redundant bitcast like
` bitcast ptr addrspace(3) @gs to ptr addrspace(3)`
will be generated for llvm/test/CodeGen/DirectX/typed_ptr.ll
**Behavior After Change**
No bitcast will be generated.
Fixes https://github.com/llvm/llvm-project/issues/65183
We currently have log, log2, log10, exp and exp2 intrinsics. Add exp10
to fix this asymmetry. AMDGPU already has most of the code for f32
exp10 expansion implemented alongside exp, so the current
implementation is duplicating nearly identical effort between the
compiler and library which is inconvenient.
https://reviews.llvm.org/D157871