With range and undef metadata on a call we can have vector AssertZExt
generated on a target with no vector operations. The AssertZExt needs to
scalarize to a normal `AssertZext tin, ValueType`. I have added
AssertSext too, although I do not have a test case.
Fixes#110374
This patch make a couple of improvements to ReduceLoadOpStoreWidth.
When determining the minimum size of "NewBW" we now take byte boundaries
into account. If we for example touch bits 6-10 we shouldn't accept
NewBW=8, because we would fail later when detecting that we can't access
bits from two different bytes in memory using a single load. Instead we
make sure to align LSB/MSB according to byte size boundaries up front
before searching for a viable "NewBW".
In the past we only tried to find a "ShAmt" that was a multiple of
"NewBW", but now we use a sliding window technique to scan for a viable
"ShAmt" that is a multiple of the byte size. This can help out finding
more opportunities for optimization (specially if the original type
isn't byte sized, and for big-endian targets when the original
load/store is aligned on the most significant bit).
For pre-ra scheduling, we use two options `-misched-topdown` and
`-misched-bottomup` to force the direction.
While for post-ra scheduling, we use `-misched-postra-direction`
with enumerated values (`topdown`, `bottomup` and `bidirectional`).
This is not unified and adds some mental burdens. Here we replace
these two options `-misched-topdown` and `-misched-bottomup` with
`-misched-prera-direction` with the same enumerated values.
To avoid the condition of `getNumOccurrences() > 0`, we add a new
enum value `Unspecified` and make it the default initial value.
These options are hidden, so we needn't keep the compatibility.
DAGCombiner::ReduceLoadOpStoreWidth could replace memory accesses
with more narrow loads/store, although sometimes the new load/store
would touch memory outside the original object. That seemed wrong
and this patch is simply avoiding doing the DAG combine in such
situations.
Also simplifying the expression used to align ShAmt down to a multiple
of NewBW. Subtracting (ShAmt % NewBW) should do the same thing as the
old more complicated expression.
Intention is to follow up with a patch that make more attempts, trying
to align the memory accesses at other offsets, allowing to trigger
the transform in more situations. The current strategy for deciding
size (NewBW) and offset (ShAmt) for the narrowed operations are a bit
ad-hoc, and not really considering big endian memory order in same
way as little endian.
Adding test cases related to narrowing of load-op-store sequences.
ReduceLoadOpStoreWidth isn't careful enough, so it may end up
creating load/store operations that access memory outside the region
touched by the original load/store. Using ARM as a target for the
test cases to show what happens for both little-endian and big-endian.
This patch also adds a way to override the TLI.isNarrowingProfitable
check in DAGCombiner::ReduceLoadOpStoreWidth by using the option
-combiner-reduce-load-op-store-width-force-narrowing-profitable.
Idea is that it should be simpler to for example add lit tests
verifying that the code is correct for big-endian (which otherwise
is difficult since there are no in-tree big-endian targets that
is overriding TLI.isNarrowingProfitable).
This is a pre-commit for
https://github.com/llvm/llvm-project/pull/119203
Re-landing #116970 after fixing miscompilation error.
The original change made it possible for CMPZ to have multiple uses;
`ARMDAGToDAGISel::SelectCMPZ` was not prepared for this.
Pull Request: https://github.com/llvm/llvm-project/pull/118887
Original commit message:
Following #116547 and #116676, this PR changes the type of results and
operands of some nodes to accept / return a normal type instead of Glue.
Unfortunately, changing the result type of one node requires changing
the operand types of all potential consumer nodes, which in turn
requires changing the result types of all other possible producer nodes.
So this is a bulk change.
This DAG combine was incorrect for big-endian targets, because it
assumes that when a bitcast changes the lane width, the
least-significant bits of the wider lanes are in the lower-numbered
lanes of the smaller type, which is only true for little-endian.
Following #116547 and #116676, this PR changes the type of results and
operands of some nodes to accept / return a normal type instead of Glue.
Unfortunately, changing the result type of one node requires changing
the operand types of all potential consumer nodes, which in turn
requires changing the result types of all other possible producer nodes.
So this is a bulk change.
Pull Request: https://github.com/llvm/llvm-project/pull/116970
After 9fe78db4, the pass inserts `store volatile i32 -1, ptr %call_site`
before all invoke instruction except the one in the entry block, which
has the effect of bypassing landing pads on exceptions.
When configuring the call site for a potentially throwing instruction
check that it is not `InvokeInst` -- they are handled by earlier code.
Following #116547, this changes the result of `ARMISD::CMPFP*` and the
operand of `ARMISD::FMSTAT` from a special `Glue` type to a normal type.
This change allows comparisons to be CSEd and scheduled around as can be
seen in the test changes.
Note that `ARMISD::FMSTAT` is still glued to its consumer nodes; this is
going to be changed in a separate patch.
This patch also sets `CopyCost` of `cl_FPSCR_NZCV` register class to a
negative value. The reason is the same as for CCR register class: it
makes DAG scheduler and InstrEmitter try to avoid copies of `FPCSR_NZCV`
register to / from virtual registers. Previously, this was not
necessary, since no attempt was made to create copies in the first
place.
`TRI::getCrossCopyRegClass` is modified in a way that prevents DAG
scheduler from copying FPSCR into a virtual register. The register
allocator might need to spill the virtual register, but that only seem
to work in Thumb mode.
Following #116547, this changes the result of `ARMISD::CMPFP*` and the
operand of `ARMISD::FMSTAT` from a special `Glue` type to a normal type.
This change allows comparisons to be CSEd and scheduled around as can be
seen in the test changes.
Note that `ARMISD::FMSTAT` is still glued to its consumer nodes; this is
going to be changed in a separate patch.
This patch also sets `CopyCost` of `cl_FPSCR_NZCV` register class to a
negative value. The reason is the same as for CCR register class: it
makes DAG scheduler and InstrEmitter try to avoid copies of `FPCSR_NZCV`
register to / from virtual registers. Previously, this was not
necessary, since no attempt was made to create copies in the first
place.
There might be a case when a copy can't be avoided (although not found
in existing tests). If a copy is necessary, the virtual register will be
created with `cl_FPSCR_NZCV` register class. If this register class is
inappropriate, `TRI::getCrossCopyRegClass` should be modified to return
the correct class.
Pull Request: https://github.com/llvm/llvm-project/pull/116676
1. When two (or more) nodes are glued, DAG scheduler will always
schedule them as one piece, i.e. it will not allow any instructions to
be scheduled between them. It does so because if nodes are glued this
usually means that there is an implicit register dependency between
them, and an intervening node could clobber this physical register. When
emitting such nodes into machine IR, they will also be stuck together,
e.g.:
```
%9:gpr = MOVsrl_glue killed %8, implicit-def $cpsr
%10:gpr = RRX %3, implicit $cpsr
```
2. If a node has Glue result, SelectionDAG will not try to CSE this
node. If it did, it would break the implicit physical register
dependency. In practice this means that if a node with Glue result has
multiple uses, it has to be duplicated before each use. This the reason
for `ARMTargetLowering::duplicateCmp` to exist.
When using normal data dependency, dependent nodes can freely be
scheduled around. If there is a physical register dependency between
nodes, the physical register will be copied to/from a virtual register,
allowing other nodes to intervene between them. The resulting machine IR
might look like this:
```
%9:gpr = LSRs1 killed %8, implicit-def $cpsr
%10:gpr = COPY $cpsr
%11:gpr = ORRrsi killed %9, %3, 242, 14 /* CC::al */, $noreg, $noreg
%12:gpr = BICri killed %11, -2147483648, 14 /* CC::al */, $noreg, $noreg
$cpsr = COPY %10
%13:gpr = RRX %3, implicit $cpsr
```
The two copies are likely to be eliminated by register coalescer, given
that there are no instructions between them that clobber this physical
register. If the copies are unwanted in the first place (they could be
expensive or impossible), DAG scheduler will try to avoid inserting them
wherever possible, and the resulting machine IR will look like this:
```
%9:gpr = LSRs1 killed %8, implicit-def $cpsr
%10:gpr = ORRrsi killed %9, %3, 242, 14 /* CC::al */, $noreg, $noreg
%11:gpr = BICri killed %10, -2147483648, 14 /* CC::al */, $noreg, $noreg
%12:gpr = RRX %3, implicit $cpsr
```
On ARM, arithmetic operations and LSLS already use the new data flow
approach. This patch extends it to include 1-bit shifts.
Pull Request: https://github.com/llvm/llvm-project/pull/116547
This change is part of this proposal:
https://discourse.llvm.org/t/rfc-all-the-math-intrinsics/78294
- `Builtins.td` - Add f16 support for libm atan2 builtin
- `CGBuiltin.cpp` - Emit constraint atan2 intrinsic for clang builtin
- `clang/test/CodeGenCXX/builtin-calling-conv.cpp` - Use erff instead of
atan2 for clang builtin to lib call calling convention check, now that
atan2 maps to an intrinsic.
- add atan2 cases to llvm.experimental.constrained tests for more
backends: ARM, PowerPC, RISCV, SystemZ.
- LangRef.rst: add llvm.experimental.constrained.atan2, revise
llvm.atan2 description.
Last part of Implement the atan2 HLSL Function. Fixes#70096.
This reverts commit c31014322c0b5ae596da129cbb844fb2198b4ef4.
Based on the discussions in #112772, this pass is not needed after the
introduction of `llvm.threadlocal.address` intrinsic.
Fixes https://github.com/llvm/llvm-project/issues/112771.
... matching what we have in the disassembler. This isn't turned on by
default since several of the scheduling models are not completely
accurate, and we don't want to be misleading.
If we're masking the LSB of a SRL node result and that is shifting down an extended sign bit, see if we can change the SRL to shift down the MSB directly.
These patterns can occur during legalisation when we've sign extended to a wider type but the SRL is still shifting from the subreg.
Alternative to #114967
Fixes the remaining regression in #112588
When doing a call from CMSE secure state to non-secure state for
v8-M.main, we use the VLLDM and VLSTM instructions to save, clear and
restore the FP registers around the call. These instructions both check
the CONTROL_S.SFPA bit, and if it is clear (meaning the current contents
of the FP registers are not secret) they execute as no-ops.
This causes a problem when CONTROL_S.SFPA==0 before the call, which
happens if there are no floating-point instructions executed between
entry to secure state and the call. If this is the case, then the VLSTM
instruction will do nothing, leaving the save area in the stack
uninitialised. If the called function returns a value in floating-point
registers, the call sequence includes an instruction to copy the return
value from a floating-point register to a GPR, which must be before the
VLLDM instruction. This copy sets CONTROL_S.SFPA, meaning that the VLLDM
will fully execute, and load the uninitialised stack memory into the FP
registers.
This causes two problems:
* The FP register file is clobbered, including all of the callee-saved
registers, which might contain live values.
* The stack region might contain secret values, which will be leaked to
non-secure state through the floating-point registers if/when we
return to non-secure state.
The fix is to insert a `vmov s0, s0` instruction before the VLSTM
instruction, to ensure that CONTROL_S.SFPA is set for both the VLLDM and
VLSTM instruction.
CVE: https://www.cve.org/cverecord?id=CVE-2024-7883
Security bulletin:
https://developer.arm.com/Arm%20Security%20Center/Cortex-M%20Security%20Extensions%20Vulnerability
This adds the `llvm.sincos` intrinsic, legalization, and lowering.
The `llvm.sincos` intrinsic takes a floating-point value and returns
both the sine and cosine (as a struct).
```
declare { float, float } @llvm.sincos.f32(float %Val)
declare { double, double } @llvm.sincos.f64(double %Val)
declare { x86_fp80, x86_fp80 } @llvm.sincos.f80(x86_fp80 %Val)
declare { fp128, fp128 } @llvm.sincos.f128(fp128 %Val)
declare { ppc_fp128, ppc_fp128 } @llvm.sincos.ppcf128(ppc_fp128 %Val)
declare { <4 x float>, <4 x float> } @llvm.sincos.v4f32(<4 x float> %Val)
```
The lowering is built on top of the existing FSINCOS ISD node, with
additional type legalization to allow for f16, f128, and vector values.
Some tests contain errors in constrained intrinsic usage, such as missed
or extra type parameters, wrong type parameters order and some other.
---------
Co-authored-by: Andy Kaylor <andy_kaylor@yahoo.com>
We don't need to copy byval arguments to tail calls via a temporary, if
we can prove that we are not copying from the outgoing argument area.
This patch does this when the source if the argument is one of:
* Memory in the local stack frame, which can't be used for tail-call
arguments.
* A global variable.
We can also avoid doing the copy completely if the source and
destination are the same memory location, which is the case when the
caller and callee have the same signature, and pass some arguments
through unmodified.
When passing byval arguments to tail-calls, we need to store them into
the stack memory in which this the caller received it's arguments. If
any of the outgoing arguments are forwarded from incoming byval
arguments, then the source of the copy is from the same stack memory.
This can result in the copy corrupting a value which is still to be
read.
The fix is to first make a copy of the outgoing byval arguments in local
stack space, and then copy them to their final location. This fixes the
correctness issue, but results in extra copying, which could be
optimised.
Byval arguments which are passed partially in registers get stored into
the local stack frame, but it is valid to tail-call them because the
part which gets spilled is always re-loaded into registers before doing
the tail-call, so it's OK for the spill area to be deallocated.
The ARM backend was checking that the outgoing values for a tail-call
matched the incoming argument values of the caller. This isn't
necessary, because the caller can change the values in both registers
and the stack before doing the tail-call. The actual limitation is that
the callee can't need more stack space for it's arguments than the
caller does.
This is needed for code using the musttail attribute, as well as
enabling tail calls as an optimisation in more cases.
Before this patch, redundant COPY couldn't be removed for the following
case:
```
$R0 = OP ...
... // Read of %R0
$R1 = COPY killed $R0
```
This patch adds support for tracking the users of the source register
during backward propagation, so that we can remove the redundant COPY in
the above case and optimize it to:
```
$R1 = OP ...
... // Replace all uses of %R0 with $R1
```
Fixes#113154
The encodings used for llvm.trap() on ARM were all marked as barriers
and terminators. This lead to stack frame destroy code being inserted
before the trap if the trap was the last thing in the function and it
had no return statement.
```
void fn() {
volatile int i = 0;
__builtin_trap();
}
```
Produced:
```
fn:
push {r11, lr} << stack frame create
<...>
mov sp, r11
pop {r11, lr} << stack frame destroy
.inst 0xe7ffdefe << trap
bx lr
```
All the other targets don't mark them this way, instead they mark them
with isTrap. I've changed ARM to do this, which fixes the code
generation:
```
fn:
push {r11, lr} << stack frame create
<...>
.inst 0xe7ffdefe << trap
mov sp, r11
pop {r11, lr} << stack frame destroy
bx lr
```
I've updated the existing trap test to force the need for a stack frame,
then check that the instruction immediately after the trap is resetting
the stack pointer.
debugtrap was already working but I've added the same checks for it
anyway.
The previous behavior could be harmful in some edge cases, such as
emitting a call to `fma()` in the `fma()` implementation itself.
Do this by just being more accurate in `isFMAFasterThanFMulAndFAdd()`.
This was already done for PowerPC; this commit just extends that to Arm,
z/Arch, and x86. MIPS and SPARC already got it right, but I added tests
for them too, for good measure.
Note: I don't have commit access.
Some targets (e.g. PPC and Hexagon) already did this. I think it's best
to do this consistently so that frontend authors don't run into
inconsistent results when they emit `naked` functions. For example, in
Zig, we had to change our emit code to also set `frame-pointer=none` to
get reliable results across targets.
Note: I don't have commit access.
ARM uses complex encoding of immediate values using small number of
bits. As a result, some values cannot be represented as immediate
operands, they need to be synthesized in a register. This change
implements legalization of such constants with loading values from
constant pool.
---------
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
This DAG combine replaces a floating-point load/store pair which has no
other uses with an integer one, but did not copy the memory operand
flags to the new instructions, resulting in it dropping the volatile
flag. This optimisation is still valid if one or both of the
instructions is volatile, so we can copy over the whole
MachineMemOperand to generate volatile integer loads and stores where
needed.
If SETCC or VSELECT is not legal for vector, we should not expand it,
instead we can split the vectors.
So that, some simple scale instructions can be emitted instead of
some pairs of comparation+selection.
When -mno-movt is passed to Clang, the ARM codegen correctly avoids
movt/movw pairs to take the address of __stack_chk_guard in the stack
protector code emitted into the function pro- and epilogues. However,
the Thumb2 codegen fails to do so, and happily emits movw/movt pairs
unless it is generating an ELF binary and the symbol might be in a
different DSO. Let's incorporate a check for useMovt() in the logic
here, so movt/movw are never emitted when -mno-movt is specified.
Suggestions welcome for how/where to add a test case for this.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
KnownBits::srem does not correctly set the leader zero-bits, omitting
the fact that LHS may be known-negative or known-non-negative. Fix this.
Alive2 proof: https://alive2.llvm.org/ce/z/Ugh-Dq