Convert "denormal-fp-math" and "denormal-fp-math-f32" into a first
class denormal_fpenv attribute. Previously the query for the effective
denormal mode involved two string attribute queries with parsing. I'm
introducing more uses of this, so it makes sense to convert this
to a more efficient encoding. The old representation was also awkward
since it was split across two separate attributes. The new encoding
just stores the default and float modes as bitfields, largely avoiding
the need to consider if the other mode is set.
The syntax in the common cases looks like this:
`denormal_fpenv(preservesign,preservesign)`
`denormal_fpenv(float: preservesign,preservesign)`
`denormal_fpenv(dynamic,dynamic float: preservesign,preservesign)`
I wasn't sure about reusing the float type name instead of adding a
new keyword. It's parsed as a type but only accepts float. I'm also
debating switching the name to subnormal to match the current
preferred IEEE terminology (also used by nofpclass and other
contexts).
This has a behavior change when using the command flag debug
options to set the denormal mode. The behavior of the flag
ignored functions with an explicit attribute set, per
the default and f32 version. Now that these are one attribute,
the flag logic can't distinguish which of the two components
were explicitly set on the function. Only one test appeared to
rely on this behavior, so I just avoided using the flags in it.
This also does not perform all the code cleanups this enables.
In particular the attributor handling could be cleaned up.
I also guessed at how to support this in MLIR. I followed
MemoryEffects as a reference; it appears bitfields are expanded
into arguments to attributes, so the representation there is
a bit uglier with the 2 2-element fields flattened into 4 arguments.
When targeting architectures that do not support unaligned memory
accesses or when explictly pass -mno-unaligned-access, it requires the
compiler to expand each unaligned load/store into an inline sequences.
For 32-bit operations this typically involves:
1. 4× LDRB (or 2× LDRH),
2. multiple shift/or instructions
These sequences are emitted at every unaligned access site, and
therefore contribute significant code size in workloads that touch
packed or misaligned structures.
When compiling with -Oz and in combination with -mno-unaligned-access,
this patch lowers unaligned 32 bit and 64 bit loads and stores to below
AEABI heper calls:
```
__aeabi_uread4
__aeabi_uread8
__aeabi_uwrite4
__aeabi_uwrite8
```
And it provide a way to perform unaligned memory accesses on targets
that do not support them, such as ARMv6-M or when compiling with
-mno-unaligned-access. Although each use introduces a function call
making it less straightforward than using raw loads and stores the call
itself is often much smaller than the compiler emitted sequence of
multiple ldrb/strb operations. As a result, these helpers can greatly
reduce code-size providing they are invoked more than once across a
program.
1. Functions become smaller in AEABI mode once they contain more than a
few unaligned accesses.
2. The total image .text size becomes smaller whenever multiple
functions call the same helpers.
This PR is derived from https://reviews.llvm.org/D57595, with some minor
changes.
Co-authored-by: David Green
This is a followup to https://github.com/llvm/llvm-project/pull/171288,
which removed lowering of libcalls to SDAG nodes for most libcalls that
get unconditionally canonicalized to intrinsics. This handles the
remaining fabs case, which I originally skipped due to larger test
impact.
`EstimateFunctionSizeInBytes`, in `ARMFrameLowering.cpp`, provides an
early estimate of the compiled size of a function, in a context that
wants to overestimate rather than underestimate.
In some cases it was underestimating severely, by over 20%. The
discrepancy was entirely accounted for by the fact that `COPY`
operations were not being counted at all, even though each one (or at
least each one that survives any post-regalloc optimizations) takes 2
bytes in Thumb or 4 in Arm. This could lead to a compile failure, if the
underestimated function size led frame lowering to not stack LR, but
later, `ARMConstantIslandsPass` needed to insert an intra-function
branch long enough to require a `bl` instruction, needing LR to have
been stacked.
The result of `EstimateFunctionSizeInBytes` was not directly available
for testing, so I added an `LLVM_DEBUG` at the end of the function. That
way, the test file doesn't need to try to make a >2048 byte function
estimated at <2048 bytes; it just needs to exhibit a function with a
single `COPY` and make sure it's counted.
At the moment, `EstimateFunctionSizeInBytes` is only used at all in
Thumb-1 compilations, to decide whether the function is large enough to
justify stacking LR as a precaution. However, the subroutine
`ARMBaseInstrInfo::getInstSizeInBytes` which counts each individual
`MachineInstr` is called from other contexts too, so I've made it return
a sensible answer for `COPY` nodes in both of Arm and Thumb.
This restriction was originally added in
https://reviews.llvm.org/D143256, with the given justification:
> Currently, in TargetLowering, if the target does not support fminnum,
we lower to fminimum if neither operand could be a NaN. But this isn't
quite correct because fminnum and fminimum treat +/-0 differently; so,
we need to prove that one of the operands isn't a zero.
As far as I can tell, this was never correct. Before
https://github.com/llvm/llvm-project/pull/172012, `minnum` and `maxnum`
were nondeterministic with regards to signed zero, so it's always been
perfectly legal to lower them to operations that order signed zeroes.
This is needed to support functionality in the AMDGPU scheduler. Various
passes have been modified to preserve MBFI to ensure that this change
does not introduce new invocations of MBFI. Some targets have passes
reordered, but there are no new runs of MBFI.
Compiling OpenSSL for Thumb was giving a crash in `ARMConstantIslands`
with error message: "underestimated function size". Adding a size for
`tLDRLIT_ga_pcrel` pseudo instruction fixes the issue. Also added a
size for `tLDRLIT_ga_abs` as per review comments.
The index on a vsdot and vudot instruction can be 0/1 from a D-reg, not 0/1/2/3
from a Q reg as would be expected. Add a pattern to allow extracting from the
high half of the input vector.
Fixes#174688
Add support for `__builtin_stack_address` builtin. The semantics match
those of GCC's builtin with the same name.
`__builtin_stack_address` returns the starting address of the stack
region that may be used by called functions. It may or may not include
the space used for on-stack arguments passed to a callee (See [GCC
Bug/121013](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121013)).
Fixes#82632.
The pass now contains a non-fp expansion and should
be used for any similar expansions regardless of the
types involved. Hence a generic name seems apt.
Rename the source files, pass, and adjust the pass
description. Move all tests for the expansions
that have previously been merged into the pass
to a single directory.
Both passes expand instructions at the IR level.
They use the same kind of instruction visitation
logic and contain significant code duplication e.g.
for scalarization.
fixes https://github.com/llvm/llvm-project/issues/98389
As the issue describes, promoting `llvm.fma.f16` to `llvm.fma.f32` does
not work, because there is not enough precision to handle the repeated
rounding. `f64` does have sufficient space. So this PR explicitly
promotes the 16-bit fma to a 64-bit fma.
I could not find examples of a libcall being used for fma, but that's
something that could be looked in separately to work around code size
issues.
PHIs that are larger than a legal integer type are split into multiple
virtual registers that are numbered sequentially. We can propagate the
known bits for each of these registers individually.
Big endian is not supported yet because the register order needs to be
reversed.
Fixes#171671
This is a followup to https://github.com/llvm/llvm-project/pull/171114,
removing the handling for most libcalls that are already canonicalized
to intrinsics in the middle-end. The only remaining one is fabs, which
has more test coverage than the others.
SDAG currently tries to lower certain libcalls to ISD opcodes. However,
many of these are already canonicalized from libcalls to intrinsic in
the middle-end (and often already emitted as intrinsics in the
front-end).
I believe that SDAG should not be doing anything for such libcalls. This
PR just drops a single libcall to get consensus on the direction, as
these changes need a non-trivial amount of test updates.
A lot of the remaining libcalls *should* probably also be canonicalized
to intrinsics in the middle-end when annotated with `memory(none)`, but
that would require additional work in SimplifyLibCalls.
Emitting the symbol in `emitGlobalAlias` seemed most efficient,
otherwise I think you'd have to traverse all aliases. I have verified
that the additional symbol is picked up by `arm-none-eabi-ld` and
correctly generates an entry in `veneers.o`.
Fixes#162084
In the Dhrystone benchmark, I find some adjacent global not be merged,
on the contrary the GCC's anchor optimize is work. Use
global-merge-max-offset to set the max offset can yield similar results
(still slightly different, at least we can control the offset).
When the stars align to conspire against stack alignment, when we have
frame-pointer=non-leaf we can incorrectly skip preserving fp/r7 in the
prolog.
The fix here first makes sure we're using the right frame pointer
register in the context of preserving the incoming FP, and then make sure that we
save the FP when re-alignment is known to be necessary.
rdar://162462271
For tail-calls we want to re-use the caller stack-frame and potentially
need to copy stack arguments.
For large stack arguments, such as by-val structs, this can lead to
overwriting incoming stack arguments when preparing outgoing ones by
copying them. E.g., in cases like
%"struct.s1" = type { [19 x i32] }
define void @f0(ptr byval(%"struct.s1") %0, ptr %1) {
tail call void @f1(ptr %1, ptr byval(%"struct.s1") %0)
ret void
}
declare void @f1(ptr, ptr)
that swap arguments, the last bytes of %0 are on the stack, followed by
%1. To prepare the outgoing arguments, %0 needs to be copied and %1
needs to be loaded into r0. However, currently the copy of %0
overwrites the location of %1, resulting in loading garbage into r0.
We fix that by forcing the load to the pointer stack argument to happen
before the copy.
The subtarget may not be set if no functions are present in the module.
Attempt to use the TargetMachine directly in more cases.
Fixes#165422Fixes#167577
sincospi/sincospif/sincospil does not appear to exist on common
targets. Darwin targets have __sincospi and __sincospif, so define
and use those implementations. I have no idea what version added
those calls, so I'm just guessing it's the same conditions as
__sincos_stret.
Most of this patch is working to preserve codegen when a vector
library is explicitly enabled. This only covers sleef and armpl,
as those are the only cases tested.
The multiple result libcalls have an aberrant process where the
legalizer looks for the scalar type's libcall in RuntimeLibcalls,
and then cross references TargetLibraryInfo to find a matching
vector call. This was unworkable in the sincospi case, since the
common case is there is no scalar call available. To preserve
codegen if the call is available, first try to match a libcall
with the vector type before falling back on the old scalar search.
Eventually all of this logic should be contained in RuntimeLibcalls,
without the link to TargetLibraryInfo. In principle we should perform
the same legalization logic as for an ordinary operation, trying
to find a matching subvector type with a libcall.
In the call graph section, we were emitting the temporary label
pointing to the start of the function instead of the canonical linkage
correct function symbol. This patch fixes it and updates the
corresponding tests.
This patch does the same changes as D143001 for AArch64.
This PR is part of the work on adding strict FP support in ARM, which
was previously discussed in #137101.
This consists of marking the various strict opcodes as legal, and
adjusting instruction selection patterns so that 'op' is 'any_op'. The
changes are similar to those in D114946 for AArch64.
Custom lowering and promotion are set for some FP16 strict ops to work
correctly.
This PR is part of the work on adding strict FP support in ARM, which
was previously discussed in #137101.
I'm not sure if this is the best way forward or not, but we have a lot
of issues with forgetting that shuffle_vectors can be scalar again and
again. (There is another example from the recent known-bits code added
recently). As a scalar-dst shuffle vector is just an extract, and a
scalar-source shuffle vector is just a build vector, this patch makes
scalar shuffle vector illegal and adjusts the irbuilder to create the
correct node as required.
Most targets do this already through lowering or combines. Making scalar
shuffles illegal simplifies gisel as a whole, it just requires that
transforms that create shuffles of new sizes to account for the scalar
shuffle being illegal (mostly IRBuilder and LessElements).
Implement KCFI (Kernel Control Flow Integrity) backend support for
ARM32, Thumb2, and Thumb1. The Linux kernel has supported ARM KCFI via
Clang's generic KCFI implementation, but this has finally started to
[cause problems](https://github.com/ClangBuiltLinux/linux/issues/2124)
so it's time to get the KCFI operand bundle lowering working on ARM.
Supports patchable-function-prefix with adjusted load offsets. Provides
an instruction size worst case estimate of how large the KCFI bundle is
so that range-limited instructions (e.g. cbz) know how big the indirect
calls can become.
ARM implementation notes:
- Four-instruction EOR sequence builds the 32-bit type ID byte-by-byte
to work within ARM's modified immediate encoding constraints.
- Scratch register selection: r12 (IP) is preferred, r3 used as fallback
when r12 holds the call target. r3 gets spilled/reloaded if it is
being used as a call argument.
- UDF trap encoding: 0x8000 | (0x1F << 5) | target_reg_index, similar
to aarch64's trap encoding.
Thumb2 implementation notes:
- Logically the same as ARM
- UDF trap encoding: 0x80 | target_reg_index
Thumb1 implementation notes:
- Due to register pressure, 2 scratch registers are needed: r3 and r2,
which get spilled/reloaded if they are being used as call args.
- Instead of EOR, add/lsl sequence to load immediate, followed by
a compare.
- No trap encoding.
Update tests to validate all three sub targets.
This patch ports the ISD::SUB handling from SelectionDAG’s ComputeNumSignBits to GlobalISel.
Related to https://github.com/llvm/llvm-project/issues/150515.
---------
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
Co-authored-by: Simon Pilgrim <llvm-dev@redking.me.uk>
Based on top of #157211.
`FNEG` and `FABS` must preserve signalling NaNs, meaning they should not
convert to f32 to perform the operation. Instead legalize to `XOR` and
`AND`.
Fixes almost all of #104915