Checks for implicit defs that are unused within a pattern and mark them as dead.
This is done directly at the TableGen level forr efficiency.
The instructions are directly created with the "dead" operand and no further analysis is needed later.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D157273
This allows us to use the same canonicalizations as PSHUFD/SHUFPS etc. to avoid unnecessary demanded elts (better splat detection, blend pass through etc.) instead of defaulting to zero mask values.
As noted in D156990, the logic in ExpandIntRes_FP_TO_SINT assumes that
if the type action for the float type is TypeSoftPromoteHalf, is must
have been an f16 (half). However, the meaning of that type action has
been overloaded and it is used for both f16 and bf16. This patch adds an
appropriate check to ensure ISD::FP16_TO_FP or ISD::BF16_TO_FP is
emitted as required.
Differential Revision: https://reviews.llvm.org/D157287
For Thumb-1 Execute-Only, expandLoadStackGuardBase generates a tMOVimm32 pseudo when calculating the stack offset.
It does this in a context where the CSPR maybe be live. tMOVimm32 may corrupt CPSR.
To fix this, generate save/restore CPSR around the tMOVimm32 using MRS/MSR to/from a scratch register.
expandLoadStackGuardBase this runs after register allocation, so the scratch register needs to be a physical register.
Use R12 as a scratch register, as is usual when expanding a pseudo.
MSR/MRS are some of the few v6-M instructions which operate on a high register.
New stack-guard test case added which was generating incorrect code without the save/restore CPSR.
Reviewed By: stuij
Differential Revision: https://reviews.llvm.org/D156968
This happens when CMP1 and CMP3 have the same predicate (or CMP2 and CMP3 have
the same predicate).
This helps optimizations such as the fololowing one:
CMP(A,C)||CMP(B,C) => CMP(MIN/MAX(A,B), C)
CMP(A,C)&&CMP(B,C) => CMP(MIN/MAX(A,B), C)
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D156215
As described in [1][2], `-mtune=` is used to select the type of target
microarchitecture, defaults to the value of `-march`. The set of
possible values should be a superset of `-march` values. Currently
possible values of `-march=` and `-mtune=` are `native`, `loongarch64`
and `la464`.
D136146 has supported `-march={loongarch64,la464}` and this patch adds
support for `-march=native` and `-mtune=`.
A new ProcessorModel called `loongarch64` is defined in LoongArch.td
to support `-mtune=loongarch64`.
`llvm::sys::getHostCPUName()` returns `generic` on unknown or future
LoongArch CPUs, e.g. the not yet added `la664`, leading to
`llvm::LoongArch::isValidArchName()` failing to parse the arch name.
In this case, use `loongarch64` as the default arch name for 64-bit
CPUs.
Two preprocessor macros are defined based on user-provided `-march=`
and `-mtune=` options and the defaults.
- __loongarch_arch
- __loongarch_tune
Note that, to work with `-fno-integrated-cc1` we leverage cc1 options
`-target-cpu` and `-tune-cpu` to pass driver options `-march=` and
`-mtune=` respectively because cc1 needs these information to define
macros in `LoongArchTargetInfo::getTargetDefines`.
[1]: https://github.com/loongson/LoongArch-Documentation/blob/2023.04.20/docs/LoongArch-toolchain-conventions-EN.adoc
[2]: https://github.com/loongson/la-softdev-convention/blob/v0.1/la-softdev-convention.adoc
Reviewed By: xen0n, wangleiat, steven_wu, MaskRay
Differential Revision: https://reviews.llvm.org/D155824
Prevents regression on some future work to improve codegen for concat_vectors(extract_subvector(),extract_subvector()) patterns.
X86ISD::SHUF128 optimization is still pretty poor (especially the zmm variant), not optimizing the shuffle demanded elts like we do for SHUFPS.
Otherwise device libs still has issues at O0 (in OpenCL-CTS)
Depends on D156972 as well. They're unrelated fixes but both are needed to fix the issue.
Fixes SWDEV-402331
Reviewed By: #amdgpu, arsenm
Differential Revision: https://reviews.llvm.org/D156973
Idx's type can be different from Ptr's, causing a "Binary operator types must match" assertion failure when emitting the MUL.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D156972
As there is no direct bf16 libcall for these conversions, extend to f32
first.
This patch includes a tiny refactoring to pull out equivalent logic in
ExpandIntRes_XROUND_XRINT so it can be reused in
ExpandIntRes_FP_TO_{S,U}INT.
This patch also demonstrates incorrect codegen for RV32 without zfbfmin
for the newly enabled tests. As it doesn't introduce that incorrect
codegen (caused by the assumption that 'TypeSoftPromoteHalf' is only
used for f16 types), a fix will be added in a follow-up (D157287).
Differential Revision: https://reviews.llvm.org/D156990
This patch added checks for global entries in ReplaceWithVeclib testing
using ArmPL and SLEEF vector libraries.
Differential Revision: https://reviews.llvm.org/D157258
Lower to the strided/contiguous addressing mode of
ld1/ldnt1 instructions depending on register allocation.
Differential Revision: https://reviews.llvm.org/D156311
This helps simplify constant splats a little. Without this the code in
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp#L14072 always returns the
existing node.
Differential Revision: https://reviews.llvm.org/D157259
Pulled out of LowerTruncateVecPackWithSignBits - prefer shuffles unless we can cheaply split the vector. ComputeNumSignBits struggles with vXi64 through bitcasts, so we're usually better off with shuffles.
This reuses the same strategy for fixed vectors as other ops, i.e. custom lower
to a scalable *_vl SD node.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D157294
We have a variant of this for splats already, but hadn't handled the case where a single copy of the wider element can be inserted producing the entire required bit pattern. This shows up mostly in very small vector shuffle tests.
Differential Revision: https://reviews.llvm.org/D157299
Test legalization for (i7, i8, i16, i32, i48, i64) on rv32 and for (i8, i15, i16, i32, i64, i72, i128). Legalization fails for i96 on rv32 and i192 on rv64. Note that [i192 fails for AArch64](https://github.com/llvm/llvm-project/issues/64394).
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D157023
Legalize G_ADD, G_SUB, G_(S/U)ADD(O/E). We test for (s7, s48, s64, s96)
on rv32 and (s15, s72, s128, s192) on rv64.
Differential Revision: https://reviews.llvm.org/D157019
Legalize G_AND, G_OR, G_XOR for (s7, s48) on rv32 and (s15, s72) on rv64
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D157017
Legalize G_SHL, G_ASHR and G_LSHR for types narrower and upto (and including) XLen: (i7, i8,
i16 and i32) for rv32 and (i8, i15, i16, i32 and i64) for rv64. This requires
adding some rules to handle G_ANYEXT, G_ZEXT and G_SEXT.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D155772
This avoids narrowing after it has been expanded to shifts. The
G_SEXT_INREG narrowing can use the second operand of the instruction to
optimize the narrowing.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D157172
There are cases where the -1 doesn't become visible until lowering
so the folding doesn't have a chance to run.
I think in these cases there is a missed DAGCombine for truncate (undef),
which I may fix separately, but RISC-V backend should protect itself.
Fixes#64503.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D157314
Currently when a stack access is out of range of an sp-relative ldr or
str then we jump straight to generating the offset with a literal pool
load or mov32 pseudo-instruction. This patch improves that in two
ways:
* If the offset is within range of sp-relative add plus an ldr then
use that.
* When we use the mov32 pseudo-instruction, if putting part of the
offset into the ldr will simplify the expansion of the mov32 then
do so.
Differential Revision: https://reviews.llvm.org/D156875
If we have a dominant value, we can still use a v(f)slide1down to handle the last value in the vector if that value is neither undef nor the dominant value.
Note that we can extend this idea to any tail of elements, but that's ends up being a near complete merge of the v(f)slide1down insert path, and requires a bit more untangling on profitability heuristics first.
Differential Revision: https://reviews.llvm.org/D157120
A broadcast base pointer is the same as a scalar base pointer for GEP semantics (when there's at least one other vector operand). This is the form that SLP likes to emit, so we should handle it.
Differential Revision: https://reviews.llvm.org/D157132
Change the scheduler's physical register dependency tracking from
registers-and-their-aliases to regunits. This has a couple of advantages
when subregisters are used:
- The dependency tracking is more accurate and creates fewer useless
edges in the dependency graph. An AMDGPU example, edited for clarity:
SU(0): $vgpr1 = V_MOV_B32 $sgpr0
SU(1): $vgpr1 = V_ADDC_U32 0, $vgpr1
SU(2): $vgpr0_vgpr1 = FLAT_LOAD_DWORDX2 $vgpr0_vgpr1, 0, 0
There is a data dependency on $vgpr1 from SU(0) to SU(1) and from
SU(1) to SU(2). But the old dependency tracking code also added a
useless edge from SU(0) to SU(2) because it thought that SU(0)'s def
of $vgpr1 aliased with SU(2)'s use of $vgpr0_vgpr1.
- On targets like AMDGPU that make heavy use of subregisters, each
register can have a huge number of aliases - it can be quadratic in
the size of the largest defined register tuple. There is a much lower
bound on the number of regunits per register, so iterating over
regunits is faster than iterating over aliases.
The LLVM compile-time tracker shows a tiny overall improvement of 0.03%
on X86. I expect a larger compile-time improvement on targets like
AMDGPU.
Recommit after fixing AggressiveAntiDepBreaker in D156880.
Differential Revision: https://reviews.llvm.org/D156552
This patch tweaks the fix in D20627 "Do not rename registers that do not
start an independent live range" to only consider Data dependencies, not
Output or Anti dependencies. An Output or Anti dependency to a superreg
does not imply that that superreg is live at the current instruction.
This enables breaking anti-dependencies in a few more cases as shown by
the lit test updates.
Differential Revision: https://reviews.llvm.org/D156879
Begin to consolidate the similar matching code we have - all have semi-similar constraints that still need merging together to ensure we get consistent codegen depending on when the truncate is lowered.