[Recommit of e88ba6d975d887ca001cae30bfa0c53d91165148]
According to the specification in
https://github.com/ARM-software/acle/pull/309 this adds the intrinsics
void_svadd_za16_vg1x2_f16(uint32_t slice, svfloat16x2_t zn)
__arm_streaming __arm_inout("za");
void_svadd_za16_vg1x4_f16(uint32_t slice, svfloat16x4_t zn)
__arm_streaming __arm_inout("za");
void_svsub_za16_vg1x2_f16(uint32_t slice, svfloat16x2_t zn)
__arm_streaming __arm_inout("za");
void_svsub_za16_vg1x4_f16(uint32_t slice, svfloat16x4_t zn)
__arm_streaming __arm_inout("za");
as well as the corresponding `bf16` variants.
Instead of widening e.g. i8 cttz(x) to i16 cttz(x | 0x100), use the more
optimizable form cttz_zero_undef(x | 0x100) since the widened operand is
definitely not zero.
See the test file. At head, this crashes with
```
assertion failed at llvm/lib/Support/APInt.cpp:492 in
uint64_t llvm::APInt::extractBitsAsZExtValue(unsigned int, unsigned int) const:
bitPosition < BitWidth && (numBits + bitPosition) <= BitWidth &&
"Illegal bit extraction"
```
After internal intrinsic 'spv_switch' is processed we need to delete
G_BLOCK_ADDR instructions that were generated to keep track of the
corresponding basic blocks. If we just delete G_BLOCK_ADDR instructions
with BlockAddress operands, this leaves their BasicBlock counterparts in
a "address taken" status. This would make AsmPrinter to generate a
series of unneeded labels of a `"Address of block that was removed by
CodeGen"` kind. This PR is to ensure that we don't have a dangling
BlockAddress constants by zapping the BlockAddress nodes, and only after
that proceed with erasing G_BLOCK_ADDR instructions.
See also https://github.com/llvm/llvm-project/pull/87823 for more
details.
This PR generation of argument types of internal intrinsic functions
`spv_const_composite` and `spv_track_constant`, so that composite
constants of ConstantVector type preserve their correct type in
transformation passes and can be successfully used further by LLVM
intrinsic functions.
The added test case serves two purposes: it is to check the above
mentioned fix and to demonstrate that a call to __builtin_alloca() maps
to instructions from SPV_INTEL_variable_length_array when this extension
is available.
This PR is to ensure that internal intrinsic functions for PHI's operand
are inserted at the correct positions and don't break rules of
instruction domination and PHI nodes grouping at top of basic block.
The 32 bits arguments and returns on LA64 are always sign extended to
i64. So we should be taking this into account around libcalls.
Reviewed By: heiher, SixWeining
Pull Request: https://github.com/llvm/llvm-project/pull/92375
In late 2021, both Intel and AMD finally documented that every
AVX-capable CPU has always been guaranteed to execute aligned 16-byte
loads/stores atomically, and further, guaranteed that all future CPUs
with AVX will do so as well.
Therefore, we may use normal SSE 128-bit load/store instructions to
implement atomics, if AVX is enabled.
Per AMD64 Architecture Programmer's manual, 7.3.2 Access Atomicity:
> Processors that report [AVX] extend the atomicity for cacheable,
> naturally-aligned single loads or stores from a quadword to a double
> quadword.
Per Intel's SDM:
> Processors that enumerate support for Intel(R) AVX guarantee that the
> 16-byte memory operations performed by the following instructions will
> always be carried out atomically:
> - MOVAPD, MOVAPS, and MOVDQA.
> - VMOVAPD, VMOVAPS, and VMOVDQA when encoded with VEX.128.
> - VMOVAPD, VMOVAPS, VMOVDQA32, and VMOVDQA64 when encoded with
> EVEX.128 and k0 (masking disabled).
This was also confirmed to be true for Zhaoxin CPUs with AVX, in
https://gcc.gnu.org/PR104688
I accidentally left out the code to transfer sret attributes to entry
thunks, so values weren't being passed in the right registers, and the
sret pointer wasn't returned in the correct register.
Fixes#90229
This allows us to handle cases where the constant has already been type legalized behind a bitcast
Despite calling ComputeKnownBits I'm not seeing any notable change in compile time.
The previous logic did not consider whether the architectural features
meet the requirements of the ABI, resulting in the generation of
incorrect object files in some cases. For example:
```
llc -mtriple=loongarch64 -filetype=obj test/CodeGen/LoongArch/ir-instruction/fadd.ll -o t.o
llvm-readelf -h t.o
```
The object file indicates the ABI as lp64d, however, the generated code
is lp64s.
The new logic introduces the `feature-implied` ABI. When both target-abi
and triple-implied ABI are invalid, the feature-implied ABI is used.
Reviewed By: SixWeining, xen0n
Pull Request: https://github.com/llvm/llvm-project/pull/92223
This is a pre-commit for modifying `computeTargetABI` logic.
This patch will provide warning prompts when using those ABIs that have
not yet been standardized.
Reviewed By: xen0n, SixWeining
Pull Request: https://github.com/llvm/llvm-project/pull/92222
This patch adds a peephole to AArch64PostSelectOptimize for codegen
that is caused by RegBankSelect limiting G_EXTRACT_VECTOR_ELT
only to FPR registers in both the input and output registers. This can
cause a generation of COPY from FPR to GPR when, for example, the
output register of the G_EXTRACT_VECTOR_ELT is used in a branch
condition.
This was noticed when looking at codegen differences between SDAG and GI
for the s1279 kernel in the TSVC benchmark.
They *are* still accepted by the HW but have a conservative effect.
Leave them untouched since handling them would complicate the logic a
bit, and developers who code to such a low level really need to revisit
what they're doing anyway.
As discussed in the last sync-up call, because these profiles are not
yet finalised they shouldn't be exposed to users unless they opt-in to
them (much like experimental extensions). We may later want to add a
more specific flag, but reusing `-menable-experimental-extensions`
solves the immediate problem.
This is implemented using the new support for marking profiles s
experimental added in #91993 to move the unratified profiles to
RISCVExperimentalProfile and making the necessary changes to logic in
RISCVISAInfo to handle this.
Before #91440 a VSETVLIInfo would have had an IMPLICIT_DEF defining
instruction, but now we look up a VNInfo which doesn't exist, which
triggers an assertion failure. Mark these undef AVLs as AVLIsIgnored.
By default, EmitCmp avoids cmpw with i16 immediates due to 66/67h length-changing prefixes causing stalls, instead extending the value to i32 and using a cmpl with a i32 immediate, unless it has the TuningFastImm16 flag or we're building for optsize/minsize.
However, if we're loading the value for comparison, the performance costs of the decode stalls are likely to be exceeded by the impact of the load latency of the folded load, the shorter encoding and not needing an extra register to store the ext-load.
This matches the behaviour of gcc and msvc.
Fixes#90355
In a similar vein to #90049, we currently model all of the effects of a
vsetvli pseudo:
* VL and VTYPE are marked as defs
* VL preserving x0,x0 vsetvlis doesn't get emitted until
RISCVInsertVSETVLI, and when they are they have implicit uses on VL
* Regular vector pseudos are fully modelled too: Before
RISCVInsertVSETVLI they can be moved between vsetvli pseudos because we
will eventually insert vsetvlis to correct VL and VTYPE. Afterwards,
they will have implicit uses on VL and VTYPE.
Since we model everything we can remove hasSideEffects=1. This gives us
some improvements like sinking in vsetvli-insert-crossbb.ll.
We need to update RISCVDeadRegisterDefinitions to keep handling vsetvli
pseudos since it only operates on instructions with unmodelled side
effects.
Avoid using leading numbers in check prefixes - replace with actual triple config names (and makes it easier to add X64 test coverage in a future commit).
When expanding an L128 (which is used to reload i128) it is
possible that the quadword destination register clobbers an
address register. This patch adds an assertion against the case
where both of the expanded parts clobber the address, and in the
case where one of the expanded parts do so puts it last.
Fixes#91437
The middle end will remove the inner vsetvli otherwise, and it's more
typical to set the AVL to the remaining VL.
This also prevents the test from showing up as a regression in #91319
Split off from #70549, this patch moves RISCVInsertVSETVLI to after phi
elimination where we exit SSA and need to move to LiveVariables.
The motivation for splitting this off is to avoid the large scheduling
diffs from moving completely to after regalloc, and instead focus on
converting the pass to work on LiveIntervals.
The two main changes required are updating VSETVLIInfo to store VNInfos
instead of MachineInstrs, which allows us to still check for PHI defs in
needVSETVLIPHI, and fixing up the live intervals of any AVL operands
after inserting new instructions.
On O3 the pass is inserted after the register coalescer, otherwise we
end up with a bunch of COPYs around eliminated PHIs that trip up
needVSETVLIPHI.
Co-authored-by: Piyou Chen <piyou.chen@sifive.com>