The new experimental calling convention preserve_none is the opposite
side of existing preserve_all. It tries to preserve as few general
registers as possible. So all general registers are caller saved
registers. It can also uses more general registers to pass arguments.
This attribute doesn't impact floating-point registers. Floating-point
registers still follow the c calling convention.
Currently preserve_none is supported on X86-64 only. It changes the c
calling convention in following fields:
* RSP and RBP are the only preserved general registers, all other
general registers are caller saved registers.
* We can use [RDI, RSI, RDX, RCX, R8, R9, R11, R12, R13, R14, R15, RAX]
to pass arguments.
It can improve the performance of hot tailcall chain, because many
callee saved registers' save/restore instructions can be removed if the
tail functions are using preserve_none. In my experiment in protocol
buffer, the parsing functions are improved by 3% to 10%.
This patch implements the v0.8.1 specification. This patch reports
version 0.8 in llvm since `RISCVISAInfo::ExtensionVersion` only has a
`Major` and `Minor` version number. This patch includes includes support
of the `Ssnpm`, `Smnpm`, `Smmpm`, `Sspm` and `Supm` extensions that make
up RISC-V pointer masking.
All of these extensions require emitting attribute containing correct
`march` string.
`Ssnpm`, `Smnpm`, `Smmpm` extensions introduce a 2-bit WARL field (PMM).
The extension does not specify how PMM is set, and therefore this patch
does not need to address this. One example of how it *could* be set is
using the Zicsr instructions to update the PMM bits of the described
registers.
The full specification can be found at
https://github.com/riscv/riscv-j-extension/blob/master/zjpm-spec.pdf
…ndows x86_64
targets that use windows 64 prologue
Windows x86_64 stack frame layout is currently not compatible with
Swift's async extended frame, which reserves the slot right below RBP
(RBP-8) for the async context pointer, as it doesn't account for the
fact that a stack object in a win64 frame can be allocated at the same
location. This can cause issues at runtime, for instance, Swift's TCA
test code has functions that fail because of this issue, as they spill a
value to that slack slot, which then gets overwritten by a store into
address returned by the @llvm.swift.async.context.addr() intrinsic (that
ends up being RBP - 8), leading to an incorrect value being used at a
later point when that stack slot is being read from again. This change
drops the use of async extended frame for windows x86_64 subtargets and
instead uses the x32 based approach of allocating a separate stack slot
for the stored async context pointer.
Additionally, LLDB which is the primary consumer of the extended frame
makes assumptions like checking for a saved previous frame pointer at
the current frame pointer address, which is also incompatible with the
windows x86_64 frame layout, as the previous frame pointer is not
guaranteed to be stored at the current frame pointer address. Therefore
the extended frame layout can be turned off to fix the current
miscompile without introducing regression into LLDB for windows x86_64
as it already doesn't work correctly. I am still investigating what
should be made for LLDB to support using an allocated stack slot to
store the async frame context instead of being located at RBP - 8 for
windows.
ISel handles filling in x4/x5 when calling variadic functions as they
don't correspond to the 5th/6th X64 arguments but rather to the end of
the shadow space on the stack and the size in bytes of all stack
parameters (ignored and written as 0 for calls from entry thunks).
Will PR a follow up with ISel handling after this is merged.
Handle masked predicated movss/movsd in addConstantComments now that we can generically handle the destination + mask register
This will more significantly help improve 'fixup constant' comments from #73509
Handle masked predicated load/broadcasts in addConstantComments now that we can generically handle the destination + mask register
This will more significantly help improve 'fixup constant' comments from #73509
Correct AMDGPU strictfp tests to follow the rules documented in the
LangRef:
https://llvm.org/docs/LangRef.html#constrained-floating-point-intrinsics
These tests needed the strictfp attribute added to function calls and
some declarations.
Some of the tests now pass with D146845, others get farther along and
fail with D146845. The tests revealed that further work is required
in mostly AMDGPU atomics to get the tests passing.
Since I was here anyway I removed the strictfp attribute from some
constrained intrinsic declarations. They have this attribute by default.
Test changes verified with D146845.
This enables IR expansion for i128 divisions. The vector case is still
broken because ExpandLargeDivRem doesn't try to handle them.
Fixes: SWDEV-426193
Implement PhiLoweringHelper for GlobalISel in DivergenceLoweringHelper.
Use machine uniformity analysis to find divergent i1 phis and select
them as lane mask phis in same way SILowerI1Copies select VReg_1 phis.
Note that divergent i1 phis include phis created by LCSSA and all cases
of uses outside of cycle are actually covered by "lowering LCSSA phis".
GlobalISel lane masks are registers with sgpr register class and S1 LLT.
TODO: General goal is that instructions created in this pass are fully
instruction-selected so that selection of lane mask phis is not split
across multiple passes.
patch 3 from: https://github.com/llvm/llvm-project/pull/73337
The SGPR registers used for preserving EXEC mask while lowering the
whole-wave register spills and copies should be preserved at the prolog
and epilog if they are in the CSR range. It isn't happening when there
is only wwm-copy lowered and there are no wwm-spills. This patch
addresses that problem.
Further develops the vsextload support added in #79815 / b5d35feacb7246573c6a4ab2bddc4919a4228ed5 - reduces the size of the vector constant by storing it in the constant pool in a truncated form, and zero-extend it as part of the load.
Align the values of the immediate operand of BRK instruction with those
used by the existing arm64e implementation.
Make AuthCheckMethod::DummyLoad use the requested register
instead of LR.
Introduce Code Object V6 in Clang, LLD, Flang and LLVM. This is the same
as V5 except a new "generic version" flag can be present in EFLAGS. This
is related to new generic targets that'll be added in a follow-up patch.
It's also likely V6 will have new changes (possibly new metadata
entries) added later.
Docs change are part of the follow-up patch #76955
While working on -riscv-experimental-rv64-legal-i32, I noticed this
missed optimization in our current codegen.
This expands to SADDO/SSUBO+select while still in i32. These will
be type legalized individually.
If we have a shifted mask, we may be able to reduce the load width
to the width of the non-zero part of the mask and use an offset
to the base address to remove the srl. The offset is given by
C+trailingzeros(ShiftedMask).
Then we add a final shl to restore the trailing zero bits.
I've use the ARM test because that's where the existing (and (srl
(load))) tests were.
The X86 test was modified to keep the H register.
Previously we stored MachineInstr which restricted the implementation
to only handle operand 0.
The TH_LWD instruction has two sign extended destinations.
Implement handling of get/set floating point environment for ARM in
Global Instruction Selector. Lowering of these intrinsics to operations
on FPSCR was previously inplemented in DAG selector, in GlobalISel it is
reused.
We had the isel patterns, but no tests that used them. We only had
sextload and zextload tests.
Also reduce the alignment on some of the test cases that were
unnecessarily over aligned.
For the current version of the PR43024 test, we should be able to
optimize away the operations but fail to do so. This commit adds a
strictfp version of the test where we should not be able to optimize
away the operations, as a verification that changes to improve the other
effect have no adverse effect.
This is needed with RV64LegalI32 when the setcc is created after type
legalization. An i1 xor would have been promoted to i32, but the setcc
would have i64 result.
The default lowering will use shifts to make use of an i32 setcc.
We don't support i32 setcc, so its better to sig extend the low
32 bits and compare the full 64 bit result. This gives produces
mul+mulw+xor+snez like we do without RV64LegalI32.
Modify the initial implementation (https://reviews.llvm.org/D46745) to
support a constant offset so that the following code will compile:
```
int a[2][2];
void foo() { asm("// %0" :: "S"(&a[1][1])); }
```
We use the generic code path for "s". In GCC's aarch64 port, "S" is
supported for PIC while "s" isn't, making "s" less useful. We implement
"S" but not "s".
Similar to #80201 for RISC-V.