If we are lowering a frem and the divisor is known to be an integer power-2, we
can use the formula 'frem = x - trunc(x / d) * d'. This avoids the more
expensive call to fmod. The results are identical as fmod so long as d is a
power-2 (so the mul does not round incorrectly), and the sign of the return is
either always positive or not important for zeroes (nsz).
Unfortunately Alive2 does not handle this well at the moment. I was using
exhaustive checking to test this:
(https://gist.github.com/davemgreen/6078015f30d3bacd1e9572f8db5d4b64).
I found this in cpythons implementation of float_pow. I currently added it as a
DAG combine for frem with power-2 fp constants.
This was [addressed for AArch64
here](https://github.com/llvm/llvm-project/pull/79004), but the same
applies to ARM.
Move the enablement of neon+fp64 to `-mcpu=cortex-r52`, which optionally
supports these features.
When compiling for thumbv8.1m with +pacbti and making an indirect tail
call, the compiler was free to put the function pointer into R12.
This is incorrect because R12 is restored to contain authentication code
for the caller's return address.
This patch excludes R12 from the set of registers the compiler can put
the function pointer in.
Fixes https://github.com/llvm/llvm-project/issues/75998
According to langref, llvm.maximum/minimum has -0.0 < +0.0 semantics and
propagates NaN.
Expand the nodes on targets not supporting the operation, by adding
extra check for NaN and using is_fpclass to check zero signs.
As we canonicalizes smin with non-negative operands into umin in the
middle-end, the saturation pattern will be broken.
This patch reverts the transform in DAGCombine to fix the regression on
ARM.
Fixes https://github.com/llvm/llvm-project/issues/85706.
Following https://github.com/llvm/llvm-project/pull/68313 this patch
extends the idea to M-profile PACBTI.
The Machine Scheduler can reorder instructions within a scheduling
region depending on the scheduling policy set. If a BTI-clearing
instruction happens to partake in one such region, it might be moved
around, therefore ending up where it shouldn't.
The solution is to mark all BTI-clearing instructions as scheduling
region boundaries. This essentially means that they must not be part of
any scheduling region, and as consequence never get moved:
- PAC
- PACBTI
- BTI
- SG
Note that PAC isn't BTI-clearing, but it's replaced by PACBTI late in
the compilation pipeline.
As far as I know, currently it isn't possible to organically obtain code
that's susceptible to the bug:
- Instructions that write to SP are region boundaries. PAC seems to
always be followed by the pushing of r12 to the stack, so essentially
PAC is always by itself in a scheduling region.
- CALL_BTI is expanded into a machine instruction bundle. Bundles are
unpacked only after the last machine scheduler run. Thus setjmp and BTI
can be separated only if someone deliberately run the scheduler once
more.
- The BTI insertion pass is run late in the pipeline, only after the
last machine scheduling has run. So once again it can be reordered only
if someone deliberately runs the scheduler again.
Nevertheless, one can reasonably argue that we should prevent the bug in
spite of the compiler not being able to produce the required conditions
for it. If things change, the compiler will be robust against this
issue.
The tests written for this are contrived: bogus MIR instructions have
been added adjacent to the BTI-clearing instructions in order to have
them inside non-trivial scheduling regions.
The load narrowing part of TargetLowering::SimplifySetCC is updated
according to this:
1) The offset calculation (for big endian) did not work properly for
non byte-sized types. This is basically solved by an early exit
if the memory type isn't byte-sized. But the code is also corrected
to use the store size when calculating the offset.
2) To still allow some optimizations for non-byte-sized types the
TargetLowering::isPaddedAtMostSignificantBitsWhenStored hook is
added. By default it assumes that scalar integer types are padded
starting at the most significant bits, if the type needs padding
when being stored to memory.
3) Allow optimizing when isPaddedAtMostSignificantBitsWhenStored is
true, as that hook makes it possible for TargetLowering to know
how the non byte-sized value is aligned in memory.
4) Update the algorithm to always search for a narrowed load with
a power-of-2 byte-sized type. In the past the algorithm started
with the the width of the original load, and then divided it by
two for each iteration. But for a type such as i48 that would
just end up trying to narrow the load into a i24 or i12 load,
and then we would fail sooner or later due to not finding a
newVT that fulfilled newVT.isRound().
With this new approach we can narrow the i48 load into either
an i8, i16 or i32 load. By checking if such a load is allowed
(e.g. alignment wise) for any "multiple of 8 offset", then we can find
more opportunities for the optimization to trigger. So even for a
byte-sized type such as i32 we may now end up narrowing the load
into loading the 16 bits starting at offset 8 (if that is allowed
by the target). The old algorithm did not even consider that case.
5) Also start using getObjectPtrOffset instead of getMemBasePlusOffset
when creating the new ptr. This way we get "nsw" on the add.
These test cases show some miscomplies for big-endian when dealing
with non byte-sized loads. One part of the problem is that LLVM IR
isn't really telling where the padding goes for non byte-sized
loads/stores. So currently TargetLowering::SimplifySetCC can't assume
anything about it. But the implementation also do not consider that
the TypeStoreSize could be larger than the TypeSize, resulting in
the offset calculation being wrong for big-endian.
Pre-commit for https://github.com/llvm/llvm-project/pull/87646
D33412/D33413 introduced this to support a clang pragma to set section
names for a symbol depending on if it would be placed in
bss/data/rodata/text, which may not be known until the backend. However,
for text we know that only functions will go there, so just directly set
the section in clang instead of going through a completely separate
attribute.
Autoupgrade the "implicit-section-name" attribute to directly setting
the section on a Fuction.
Consider the following:
ldr r0, [r4]
ldr r7, [r0, #4]
cmp r7, r3
bhi .LBB0_6
cmp r0, r2
push {r0}
pop {r4}
bne .LBB0_3
movs r0, r6
pop {r4, r5, r6, r7}
pop {r1}
bx r1
Here is a snippet of the generated THUMB1 code of the K&R malloc
function that clang currently compiles to.
push {r0} ends up being popped to pop {r4}.
movs r4, r0 would destroy the flags set by cmp right above.
The compiler has no alternative in this case, except one:
the only alternative is to transfer through a high register.
However, it seems like LLVM does not consider that this is a valid
approach, even though it is a free clobbering a high register.
This patch addresses the FIXME so the compiler can do that when it can
in r10 or r11, or r12.
Following https://github.com/llvm/llvm-project/pull/68313 this patch
extends the idea to M-profile PACBTI.
The Machine Scheduler can reorder instructions within a scheduling
region depending on the scheduling policy set. If a BTI-clearing
instruction happens to partake in one such region, it might be moved
around, therefore ending up where it shouldn't.
The solution is to mark all BTI-clearing instructions as scheduling
region boundaries. This essentially means that they must not be part of
any scheduling region, and as consequence never get moved:
- PAC
- PACBTI
- BTI
- SG
Note that PAC isn't BTI-clearing, but it's replaced by PACBTI late in
the compilation pipeline.
As far as I know, currently it isn't possible to organically obtain code
that's susceptible to the bug:
- Instructions that write to SP are region boundaries. PAC seems to
always be followed by the pushing of r12 to the stack, so essentially
PAC is always by itself in a scheduling region.
- CALL_BTI is expanded into a machine instruction bundle. Bundles are
unpacked only after the last machine scheduler run. Thus setjmp and BTI
can be separated only if someone deliberately run the scheduler once
more.
- The BTI insertion pass is run late in the pipeline, only after the
last machine scheduling has run. So once again it can be reordered only
if someone deliberately runs the scheduler again.
Nevertheless, one can reasonably argue that we should prevent the bug in
spite of the compiler not being able to produce the required conditions
for it. If things change, the compiler will be robust against this
issue.
The tests written for this are contrived: bogus MIR instructions have
been added adjacent to the BTI-clearing instructions in order to have
them inside non-trivial scheduling regions.
Re-land 634b0243b8f7acc85af4f16b70e91d86ded4dc83.
T1 allow for an optional registers list,
the register list must be {d0-d15}.
T2 define a mandatory register list,
the register list must be {d0-d31}.
The requirements for T1/T2 are as follows:
T1 T2
Require: v8-M.Main, v8.1-M.Main,
secure state secure state
16 D Regs valid valid
32 D Regs UNDEFINED valid
No D Regs NOP NOP
T1 allows for an optional registers list, the register list must be {d0-d15}.
T2 defines a mandatory register list, the register list must be {d0-d31}.
The requirements for T1/T2 are as follows:
T1 T2
Require: v8-M.Main, v8.1-M.Main,
secure state secure state
16 D Regs valid valid
32 D Regs UNDEFINED valid
No D Regs NOP NOP
PR #75527 fixed ARMFrameLowering to set the IsRestored flag for LR based
on all of the return instructions in the function, not just one.
However, there is also code in ARMLoadStoreOptimizer which changes
return instructions, but it set IsRestored based on the one instruction
it changed, not the whole function.
The fix is to factor out the code added in #75527, and also call it from
ARMLoadStoreOptimizer if it made a change to return instructions.
Fixes#80287.
When using Greedy Register Allocation, there are times where
early-clobber values are ignored, and assigned the same register. This
is illeagal behaviour for these intructions. To get around this, using
Pseudo instructions for early-clobber registers gives them a definition
and allows Greedy to assign them to a different register. This then
meets the ARM Architecture Reference Manual and matches the defined
behaviour.
This patch takes the existing RISC-V patch and makes it target
independent, then adds support for the ARM Architecture. Doing this will
ensure early-clobber restraints are followed when using the ARM
Architecture. Making the pass target independent will also open up
possibility that support other architectures can be added in the future.
Global Instruction Selector could not select the code:
%0:gprb(s32) = G_CONSTANT i32 -1
In DAG selector the similar code is selected to the instruction MVNi
using custom operand `mod_imm_not`. Changing its definition from
`PatLeaf` to `ImmLeaf` and providing counterpart for `imm_not_XFORM`
make the relevant rule available for GlobalISel too.
If we have a shifted mask, we may be able to reduce the load width
to the width of the non-zero part of the mask and use an offset
to the base address to remove the srl. The offset is given by
C+trailingzeros(ShiftedMask).
Then we add a final shl to restore the trailing zero bits.
I've use the ARM test because that's where the existing (and (srl
(load))) tests were.
The X86 test was modified to keep the H register.
Implement handling of get/set floating point environment for ARM in
Global Instruction Selector. Lowering of these intrinsics to operations
on FPSCR was previously inplemented in DAG selector, in GlobalISel it is
reused.
Extra space causes the checks generated by update_mir_test_checks to be
unavailable.
```
# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 4
# RUN: llc -mtriple=x86_64-- -o - %s -run-pass=none -verify-machineinstrs -simplify-mir | FileCheck %s
---
name: foo
body: |
; CHECK-LABEL: name: foo
; CHECK: bb.0:
; CHECK-NEXT: successors:
; CHECK-NEXT: {{ $}}
; CHECK-NEXT: {{ $}}
; CHECK-NEXT: bb.1:
; CHECK-NEXT: RET 0, $eax
bb.0:
successors:
bb.1:
RET 0, $eax
...
```
The failure log is as follows:
```
llvm/test/CodeGen/MIR/X86/unreachable-block-print.mir:9:16: error: CHECK-NEXT: is on the same line as previous match
; CHECK-NEXT: {{ $}}
^
<stdin>:21:13: note: 'next' match was here
successors:
^
<stdin>:21:13: note: previous match ended here
successors:
```
PR #70014 fixes A32 to use GOT for dso_preemptable `__stack_chk_guard`
with static relocation model (e.g. -fPIE/-fPIC LTO compiles with -no-pie
linking).
This patch fixes such `__stack_chk_guard` access for Thumb1 and Thumb2.
Note: `t2LDRLIT_ga_pcrel` is only for ELF. mingw needs
`.refptr.__stack_chk_guard` (https://reviews.llvm.org/D92738).
Fix#64999
Relying on ComputeKnownBits to find a splat is causing miscompilations where a shift of zero is being assumed to give zero, but further simplification leads to a shift of zero by undef, resulting in an unexpected undef value.
Fixes#78109
This avoids possible undefined behavior using the same register for Rm
and Rda.
Additionally adds a check in MC to produce an error upon parsing this
case.
Port CodeGenPrepare to new pass manager and dependency
BasicBlockSectionsProfileReader
Fixes: #75380
Co-authored-by: Krishna-13-cyber <84722531+Krishna-13-cyber@users.noreply.github.com>
The change is fairly mechanical:
1. Factor code from `FastISel::selectIntrinsicCall`, which converts
debug intrinsics into debug instructions, into functions (NFC).
2. Call those functions for DPValues attached to instructions too.
The test updates look the same as other RemoveDIs changes: re-run the
tests with `--try-experimental-debuginfo-iterators`, which checks the
output is identical using the new debug info format (if it has been
enabled in the cmake configuration).
Depends on #76941 (otherwise some modified tests spuriously fail).
Revert e0c554ad87d18dcbfcb9b6485d0da800ae1338d1 "Port CodeGenPrepare to new pass manager (and BasicBlockSectionsProfil… (#75380)"
Revert #75380 and #77054 as they were breaking EXPENSIVE_CHECKS buildbots: https://lab.llvm.org/buildbot/#/builders/104
Port CodeGenPrepare to new pass manager and dependency
BasicBlockSectionsProfileReader
Fixes: #64560
Co-authored-by: Krishna-13-cyber <84722531+Krishna-13-cyber@users.noreply.github.com>
LLVM intrinsics `get_fpmode`, `set_fpmode` and `reset_fpmode` operate
control modes, the bits of FP environment that affect FP operations. On
ARM these bits are in FPSCR together with the status bits. The
implementation of these intrinsics produces code close to that of
functions `fegetmode` and `fesetmode` from GLIBC.
Pull request: https://github.com/llvm/llvm-project/pull/74054
In some cases, the machine outliner needs to preserve LR across an
outlined call by pushing it onto the stack. Previously, this also
generated unwind table instructions, which is incorrect because EHABI
unwind tables cannot represent different stack frames a different points
in the function, so the extra unwind info applied to the entire
function.
The outliner code already avoided generating CFI instructions, but EHABI
unwind data is generated later from the actual instructions, so we need
to avoid using the FrameSetup and FrameDestroy flags to prevent unwind
data being generated.