When checking to see if our index expressions can be converted into strided
operations, we previously gave up if the index type wasn't an exact match for
the intptrty for the address. Per gep semantics, this mismatch implies a sext
or trunc cast to the respective index type. For constants, go ahead and
evaluate that cast instead of giving up.
Note that the motivation of this is mostly test cleanup. We canonicalize at IR
such that the gep index will match the intptrty. This is mostly useful so that
we can write both RV32 and RV64 tests from the same source. Its also helpful in
preventing confusion - I've stumbled across this at least four times now and
wasted time each one.
Note: The test change for scatters unit stride cases contains a minor
regression for rv32 and 64 bit indices. This is an artifact of order in which
changes are landing. This will be addressed in a near future change for all
configurations.
The Z#_HI register definitions were created during the very early SVE
enablement work and before the SVE calling convention was locked in.
As they look entirely unused, they need to go.
If the SRL for Hi constant folds, but we don't remoe those bits from
the Lo, we can end up with strange constant folding through DAGCombine later.
I've only seen this with constants being lowered to constant pools
during lowering on RISC-V.
On the first split we create two i32 trunc stores and a srl to shift
the high part down. The srl gets constant folded, but to produce
a new i32 constant. But the truncstore for the low store still uses
the original constant.
This original constant then gets converted to a constant pool
before we revisit the stores to further split them. The constant
pool prevents further constant folding of the additional srls.
After legalization is done, we run DAGCombiner and get some constant
folding of srl via computeKnownBits which can peek through the constant
pool load. This can create new constants that also need a constant pool.
The intrinsic uses ImmArg so TargetConstant would be consistent
with how other intrinsics are handled.
This hides the constants from type legalization so we can remove
the promotion support.
isel patterns are updated accordingly.
This reverts commit 47324cfd7d8ca1a2a5cbb9f948ecff66a28ee6bc.
This exposed incorrect debuginfo in rustc. Revert the verification
until this has been fixed.
Update LiveIntervals after rewriting:
%reg = INSERT_SUBREG undef %reg, %subreg, subidx
to:
undef %reg:subidx = COPY %subreg
D113044 implemented this for the non-undef case.
Following on from D139466 which added support for dead vreg defs, this
patch adds support for "undef" defs of subregs.
Use this to regenerate checks for amx-greedy-ra-spill-shape.ll which
previously required manual tweaks to the autogenerated checks to fix an
EXPENSIVE_CHECKS failure; see commit
8b7c1fbd9647a5a6ef246a6b5b2543ea0f5a2337
It maybe emit the .option directive without any follow up. Only emit the .option push/pop when there are supported extension difference between function and module.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D159399
We can get `BlockAddress` in user code via `Labels as Values` so
we should be able to merge the access to `BlockAddress`.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D159429
For RVV, If we want to perform an i8 or i16 element-wise vector
arithmetic right shift in the upper C/C++ program, the value to be
shifted would be first sign extended to i32, and the shift amount would
also be zero_extended to i32 to perform the vsra.vv instruction, and
followed by a truncate to get the final calculation result, such pattern
will later expanded to a series of "vsetvli" and "vnsrl" instructions
later, this is because the RVV spec only support 2 * SEW -> SEW
truncate. But for vector, the shift amount can also be determined by
smin (Y, ScalarSizeInBits(Y) - 1)). Also, for the vsra instruction, we
only care about the low lg2(SEW) bits as the shift amount.
- Alive2: https://alive2.llvm.org/ce/z/u3-Zdr
- C++ Test cases : https://gcc.godbolt.org/z/q1qE7fbha
The non-constant bit/bif/bsp already work through tablegen patterns, this
patch handles the constant case, mirroring the basic support for
`or(and(X, C), and(Y, ~C))` from ISel tryCombineToBSL. BSP gets expanded
to either BIT, BIF or BSL depending on the best register allocation.
G_BIT can be replaced with G_BSP as a more general alternative.
This patch adjusts the legality check for riscv to use `cpop/cpopw` since `isOperationLegal(ISD::CTPOP, MVT::i32)` returns false on rv64gc_zbb.
Clang vs gcc: https://godbolt.org/z/rc3s4hjPh
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D156390
If the data type is larger than e16, and the requires more than LMUL1 register
class, prefer the use of vrgatherei16. This has three major benefits:
1) Less work needed to evaluate the constant for e.g. vid sequences. Remember
that arithmetic generally scales lineary with LMUL.
2) Less register pressure. In particular, the source and indices registers
*can* overlap so using a smaller index can significantly help at m8.
3) Smaller constants. We've got a bunch of tricks for materializing small
constants, and if needed, can use a EEW=16 load.
If we have a gather or a scatter whose index describes a permutation of the
lanes, we can lower this as a shuffle + a unit strided memory operation. For
RISCV, this replaces a indexed load/store with a unit strided memory operation
and a vrgather (at worst).
I did not bother to implement the vp.scatter and vp.gather variants of these
transforms because they'd only be legal when EVL was VLMAX. Given that, they
should have been transformed to the non-vp variants anyways. I haven't checked
to see if they actually are.
Some instruction selection patterns required for ALU GPR instructions have
already been automatically imported from existing TableGen descriptions -
this patch simply adds testing for them. The first of the GIComplexPatternEquiv
definitions required to select the shiftMaskXLen ComplexPattern has been added.
Some instructions require special handling due to i32 not being a legal
type on RV64 in SelectionDAG so we can't reuse SelectionDAG patterns.
Co-authored-by: Lewis Revill <lewis.revill@embecosm.com>
Reviewed By: nitinjohnraj
Differential Revision: https://reviews.llvm.org/D76445
In D154687, we added a transform to narrow indexed load/store indices of the
form (shl (zext), C). We can move this into a generic transform over the
target independent nodes instead, and pick up the fixed vector cases with no
additional work required. This is an alternative to D158163.
Performing this transform points out that we weren't eliminating zero_extends
via the the generic DAG combine. Adjust the (existing) callbacks so that we
do.
This change *removes* the existing transform on the target specific intrinsic
nodes. If anyone has a use case this impacts, please speak up.
Note: Reviewed as part of a stack of changes in PR# 66405.
If a virtual register is not assigned preferred physical register, it means some
COPY instructions will be changed to real register move instructions. In this
case we can try to split the virtual register in colder blocks, if success, the
original COPY instructions can be deleted, and the new COPY instructions in
colder blocks will be generated as register move instructions. It results in
fewer dynamic register move instructions executed.
The new test case split-reg-with-hint.ll gives an example, the hot path contains
24 instructions without this patch, now it is only 4 instructions with this
patch.
Differential Revision: https://reviews.llvm.org/D156491
We have the analogous case in the single insert path. The reasoning here is that if the original VL fits in LMUL1, we'd prefer to clobber a few extra dead lanes than to force two VL toggles. VTYPE toggles are generally cheaper than VL toggles.
The indirect lowering hinders the outliner's ability to see that
sequences are in fact common, since the sequence similarity is rendered
opaque by the register callee. The size savings from making them
indirect seems to be dwarfed by the outliner's savings from
de-duplication.
rdar://115178034
rdar://115459865
With both +sve and +dotprod, and a scalable vecreduce(sext) we could attempt to
access the number of elements of a scalable vector. Guard against this for now,
until scalable dotprod are properly supported.
Reapply after fixing a clang bug this exposed in D158972 and
adjusting a number of tests that failed for 32-bit targets.
-----
Add a check that the DILocalVariable fragment size in dbg.declare
does not exceed the size of the alloca.
This would have caught the invalid debuginfo regenerated by rustc
in https://github.com/llvm/llvm-project/issues/64149.
Differential Revision: https://reviews.llvm.org/D158743
This matches what is done on calls, since
cc981d285d1aa33df201605b9a3e22dd2311ead2 (extended for another case in
5a751e747dbf2c267e944aa961e21de7a815e7eb).
Apply both those cases on invoke just like is done for call.
Also update the preexisting comment which was left without update in
5a751e747dbf2c267e944aa961e21de7a815e7eb.
This fixes github issue #61941.
LLVM fails to build on 32-bit Solaris/SPARC: several programs fail to
link due to undefined references to `__multi3`. This reference is from
`lib/libLLVMScalarOpts.a(LoopStrengthReduce.cpp.o)`. However, This
function exists neither in the 32-bit `libgcc.a` nor in
`libclang_rt.builtins-sparc.a`. It's only defined in their 64-bit
counterparts.
The same issue affects several 32-bit targets, e.g. 32-bit PowerPC as
described in Issue #54460. The fix is the same: inhibit the libcall for
32-bit compilations. This patch does just that, regenerating the
affected testcases. It allows the build to complete.
Tested on `sparc-sun-solaris2.11`.
Currently clang's medium code model treats all data as large, putting them in a large data section and using more expensive instruction sequences to access them.
Following gcc's -mlarge-data-threshold, which allows putting data under a certain size in a normal data section as opposed to a large data section. This allows using cheaper code sequences to access some portion of data in the binary (which will be implemented in LLVM in a future patch).
And under the medium codel mode, only put data above the large data threshold into large data sections, not all data.
Reviewed By: MaskRay, rnk
Differential Revision: https://reviews.llvm.org/D149288
The ampere1 scheduling model uses IsCheapLSL predicates for ADDXri and
ADDWrr instructions, which only have 3 operands. In attempting to check
that the third is a shift, the predicate can attempt to access an out of
bounds operand, hitting an assert. This splits the rr/ri instructions
(which can never have shifts) from the rs/rx instructions to ensure they
both work correctly. Ampere1Write_1cyc_1AB was chosen for the rr/ir
instructions to match the cheap case.
This also sets CompleteModel = 0 for the ampere1 scheduling model, as at
runtime under debug it will attempt to check that as well as all
instructions having scheduling info, there is information for each
output operand.
DefIdx 1 exceeds machine model writes for
renamable $w9, renamable $w8 = LDPWi renamable $x8, 0
(Try with MCSchedModel.CompleteModel set to false)incomplete machine
model
On some AArch64 cores, including Ampere's ampere1 and ampere1a
architectures, load and store pair instructions are faster compared to
simple loads/stores only when the alignment of the pair is at least
twice that of the individual element being loaded.
Based on that, this patch introduces four new subtarget features, two
for controlling ldp and two for controlling stp, to cover the ampere1
and ampere1a alignment needs and to enable optional fine-grained control
over ldp and stp generation in general. The latter can be utilized by
another cpu, if there are possible benefits
with a different policy than the default provided by the compiler.
More specifically, for each of the ldp and stp respectively we have:
- disable-ldp/disable-stp: Do not emit ldp/stp.
- ldp-aligned-only/stp-aligned-only: Emit ldp/stp only if the source
pointer is aligned to at least double the alignment of the type.
Therefore, for -mcpu=ampere1 and -mcpu=ampere1a
ldp-aligned-only/stp-aligned-only become the defaults, because of the
benefit from the alignment, whereas for the rest of the cpus the default
behaviour of the compiler is maintained.
The stores created when passing operands via memory don't typically
maintain the chain, because they can be done in any order. Instead,
a new chain is created based on all collated stores. SVE parameters
passed via memory don't follow this idiom and try to maintain the
chain, which unfortunately can result in them being incorrectly
deadcoded when the chain is recreated.
This patch brings the SVE side in line with the non-SVE side to
ensure no stores become lost whilst also allowing greater flexibility
when ordering the stores.
We apparently somehow had lowering for the STRICT nodes without any handling
for the normal operations. This makes sure we support the LRINT and LROUND
intrinsics for fp16 when +fullfp16 is not present.