AArch64 introduced CMLA and CADD instructions as part of SVE2. This
change allows to generate such instructions when this architecture
feature is available.
Differential Revision: https://reviews.llvm.org/D153808
This is a reduced version of one of the tests that was broken by the
original commit of D154281 "[CodeGen] Store SP adjustment in
MachineBasicBlock. NFCI.".
Differential Revision: https://reviews.llvm.org/D155471
If the source operand is already all-signbits we don't need to create the sign extended elements - just splat the source element to the destination element width
Currently our AVRShiftExpand pass expands only 32-bit shifts, with the
assumption that other kinds of shifts (e.g. 64-bit ones) are
automatically reduced to 8-bit ones by LLVM during ISel.
However this is not always true and causes problems in the rust-lang runtime.
This commit changes the logic a bit, so that instead of expanding only
32-bit shifts, we expand shifts of all types except 8-bit and 16-bit.
This is not the most optimal solution, because 64-bit shifts can be
expanded to 32-bit shifts which has been deeply optimized.
I've checked the generated code using rustc + simavr, and all shifts
seem to behave correctly.
Spotted in the wild in rustc:
https://github.com/rust-lang/compiler-builtins/issues/523https://github.com/rust-lang/rust/issues/112140
Reviewed By: benshi001
Differential Revision: https://reviews.llvm.org/D154785
This patch fixes the failed test of verifyInstructionPredicates which is caused by verifyInstructionPredicates. verifyInstructionPredicates will add JMPk without checking the target predicate.
Reviewed By: benshi001
Differential Revision: https://reviews.llvm.org/D155570
The LA464 micro-architecture is very sensitive to alignment of hot code,
with performance variation of up to ~12% in the go1 benchmark suite of
the Go language (as observed by me during my work on the Go loong64
port).
[[ https://go.dev/cl/479816 | Manual alignment of certain loops ]] and [[ https://go.dev/cl/479817 | automatic alignment of loop heads ]]
helps a lot there, by reducing much of the random variation and
generally increasing performance, so we naturally want to do the same
here.
Practically, LA464 is the only LoongArch micro-architecture in wide use,
and we are currently supporting just that. The first "4" in "LA464"
stands for "4-issue", in particular its instruction fetch and decode
stages are 4-wide; so functions and branch targets should be preferably
aligned to at least 16 bytes for best throughput.
The Loongson team has benchmarked various combinations of function,
loop, and branch target alignments with GCC.
[[ https://gcc.gnu.org/pipermail/gcc-patches/2023-May/619980.html | The results ]]
show that "16-byte label alignment together with 32-byte function
alignment gives best results in terms of SPEC score". A "label" in GCC
means a branch target; while we don't currently align branch targets,
we do align loops, so in this patch we default to 32-byte function
alignment and 16-byte loop alignment.
Reviewed By: SixWeining
Differential Revision: https://reviews.llvm.org/D148622
In D155502, we added code for the compiler to check GPR-s for f16
under zhinx. This commit adds code to hit the stack when we run out of
GPR-s.
With this patch and D155502, resolves#63922
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D155507
I was attempting to add llvm.reduce.fminimum/fmaximum support for GlobalISel.
In the process I noticed that llvm.reduce.fmin/fmax was missing, and could do
with being added first. That led on to adding additional vector support for
minnum/maxnum, which in turn led to needing to handle fptrunc and fpext for
some of the fp16 types. So this patch extends the vector handling for fptrunc,
adding support for f16 types which are clamped to 4 elements, and scalarizing
the rest.
I went round in circles a little with how smaller than legal vectors should be
handled, but this seems simple and seems to work, if not always optimally yet.
Differential Revision: https://reviews.llvm.org/D155311
Instead of zero extending the inputs by masking. We can shift them
left instead. This is cheaper when we don't zext.w instruction.
This does make the case where the inputs are already zero extended
or freely zero extendable worse though.
Reviewed By: wangpc
Differential Revision: https://reviews.llvm.org/D155530
This makes Zicond and XVentanaCondOps use the same code path.
The instructions have identical semantics.
Reviewed By: wangpc
Differential Revision: https://reviews.llvm.org/D155391
Resolves#63917.
Also lets the compiler check for available GPR before hitting the stack.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D155502
The vmv1r.v v8, v9 in the last block can be removed by late
copy propagation.
Reviewed By: wangpc
Differential Revision: https://reviews.llvm.org/D155527
This patch optimizes a pair of LDRSWpre and LDRSWui (or LDURSWi)
instructions into a single LDPSWpre instruction. This is a missing case
in D99272.
MIR test cases in D152564 are updated to verify the optimization.
Differential Revision: https://reviews.llvm.org/D152407
This patch adds MIR test cases that test merging an LDRSWpre-LDR
instruction pair into an LDPSWpre instruction. This optimization is
currently missing and will be added a subsequent patch (D152407), so all
test cases are no merge for now.
Differential Revision: https://reviews.llvm.org/D152564
We currently don't extract vector elements from multi-use build vectors unless TLI.aggressivelyPreferBuildVectorSources accepts them, which seems a little extreme for constant build vectors (especially as under some cases ComputeKnownBits will indirectly extract the data for us).
This is causing a few regressions in some upcoming SimplifyDemandedBits work I'm looking at, all of which just need to know that the element is zero, so I've tweaked the fold to accept zero elements as well, which will typically fold very easily.
Differential Revision: https://reviews.llvm.org/D155582
This reverts commit b1d0bc0f4395c69097bc11b6ba8f821f621272a9.
Builds with expensive checks show that 'sp' isn't a valid register
in ADDXrr - an object file built without exprnsive checks enabled
disassembles as "add x15, xzr, x16", instead of the intended
"add x15, sp, x16".
In most places where TransferImpOps is currently used we just have one
machine instruction, so it's doing the same thing as copyImplicitOps
anyway. In those cases where we have more than one machine
instruction the destination is written to in each instruction so any
implicit defs should appear on all of them (and we shouldn't see any
implicit refs as these pseudo-instruction don't have any register
inputs), meaning the current use of TransferImpOps is incorrect and
we should be using copyImplicitOps on all of the generated
instructions.
Differential Revision: https://reviews.llvm.org/D155301
Unfortunately we can't use the standard splat_vector and vnot PatFrags because
they are preprocessed to vmv.v.x's, so we need to define helpers to catch
those. We can't use SplatPat either because we need to nest another fragment
inside of it.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D155433
We were dropping the flags and thus blocking contract into potential
fadd users. GlobalISel was already preserving the flags here.
https://reviews.llvm.org/D155443
Before this patch, the only way to generate streaming-compatible code
was to use the `-force-streaming-compatible-sve` flag, but the compiler
should also avoid the use of instructions invalid in streaming mode
when a function has the aarch64_pstate_sm_enabled/compatible attribute.
Reviewed By: paulwalker-arm, david-arm
Differential Revision: https://reviews.llvm.org/D155428
This fixes sinking a VGPR def out of a loop past the reconvergence
point at the SI_END_CF. There was a prior fix which introduced
blockPrologueInterferes (D121277) to fix the same basic problem for
the post RA sink. This also had the special case isIgnorableUse case
which was incorrect, because in some contexts the exec use is not
ignorable.
I'm thinking about a new way to represent this which will avoid
needing hasIgnorableUse and isBasicBlockPrologue, which would function
more like the exception handling.
Fixes: SWDEV-407790
https://reviews.llvm.org/D155343
D111904, D141585 made RISC-V customized lower vector ISD::CTLZ_ZERO_UNDEF/CTTZ_ZERO_UNDEF/CTLZ
by converting to float and using the float result.
Perhaps VP_CTLZ_ZERO_UNDEF/VP_CTTZ_ZERO_UNDEF/VP_CTLZ could use the similar feature.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D155150
CMP(A,C)||CMP(B,C) => CMP(MIN/MAX(A,B), C)
CMP(A,C)&&CMP(B,C) => CMP(MIN/MAX(A,B), C)
This first patch handles integer types.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D153502
This is a follow up to D149722 and aims to address https://github.com/llvm/llvm-project/issues/63885.
Local-exec accesses were not previously accounted for in XCOFFObjectWriter.
Specifically, the R_TLS_LE relocation was not previously handled, which lead to
the incorrect value being written for the relocation target.
Within this patch, the value being written is set to the symbol's virtual
address and extra relocation tests are added.
Differential Revision: https://reviews.llvm.org/D155415
Previously we returned i32 on RV32 and i64 on RV64. The instructions
only consume 32 bits and only produce 32 bits. For RV64, the result
is sign extended to 64 bits like *W instructions.
This patch removes this detail from the interface to improve
portability and consistency. This matches the proposal for scalar
intrinsics here https://github.com/riscv-non-isa/riscv-c-api-doc/pull/44
I've included IR autoupgrade support as well.
I'll be doing this for other builtins/intrinsics that currently use
'long' in other patches.
Reviewed By: VincentWu
Differential Revision: https://reviews.llvm.org/D154647