`RegisterClassInfo` was supposed to be kept alive between pass runs,
which wasn't being done leading to recomputations increasing the compile
time.
Now the Impl class is a member of the legacy and new passes so that it
is not reconstructed on every pass run.
---------
Co-authored-by: Christudasan Devadasan <christudasan.devadasan@amd.com>
Fix the case where the vector element type of the loaded extractelement
input does not match the result type of the extract.
This fixes a regression reported after
c55a7659b38946350315ac4a18d9805deb1f0a54
This fixes the handling of subregister extract copies. This
will allow AMDGPU to remove its implementation of
shouldRewriteCopySrc, which exists as a 10 year old workaround
to this bug. peephole-opt-fold-reg-sequence-subreg.mir will
show the expected improvement once the custom implementation
is removed.
The copy coalescing processing here is overly abstracted
from what's actually happening. Previously when visiting
coalescable copy-like instructions, we would parse the
sources one at a time and then pass the def of the root
instruction into findNextSource. This means that the
first thing the new ValueTracker constructed would do
is getVRegDef to find the instruction we are currently
processing. This adds an unnecessary step, placing
a useless entry in the RewriteMap, and required skipping
the no-op case where getNewSource would return the original
source operand. This was a problem since in the case
of a subregister extract, shouldRewriteCopySource would always
say that it is useful to rewrite and the use-def chain walk
would abort, returning the original operand. Move the process
to start looking at the source operand to begin with.
This does not fix the confused handling in the uncoalescable
copy case which is proving to be more difficult. Some currently
handled cases have multiple defs from a single source, and other
handled cases have 0 input operands. It would be simpler if
this was implemented with isCopyLikeInstr, rather than guessing
at the operand structure as it does now.
There are some improvements and some regressions. The
regressions appear to be downstream issues for the most part. One
of the uglier regressions is in PPC, where a sequence of insert_subrgs
is used to build registers. I opened #125502 to use reg_sequence instead,
which may help.
The worst regression is an absurd SPARC testcase using a <251 x fp128>,
which uses a very long chain of insert_subregs.
We need improved subregister handling locally in PeepholeOptimizer,
and other pasess like MachineCSE to fix some of the other regressions.
We should handle subregister composes and folding more indexes
into insert_subreg and reg_sequence.
I am planning to add some optimization remarks to the
`PartiallyInlineLibCalls` pass. However, since this pass does not emit any
optimization remarks yet, I have to add the "infrastructure" for that first, which
is what this PR is about.
When generating a constant vector, if `UseSplat` is false, the indices
different from the index of the extract can be filled with `poison`
instead of `undef`.
With range and undef metadata on a call we can have vector AssertZExt
generated on a target with no vector operations. The AssertZExt needs to
scalarize to a normal `AssertZext tin, ValueType`. I have added
AssertSext too, although I do not have a test case.
Fixes#110374
This patch make a couple of improvements to ReduceLoadOpStoreWidth.
When determining the minimum size of "NewBW" we now take byte boundaries
into account. If we for example touch bits 6-10 we shouldn't accept
NewBW=8, because we would fail later when detecting that we can't access
bits from two different bytes in memory using a single load. Instead we
make sure to align LSB/MSB according to byte size boundaries up front
before searching for a viable "NewBW".
In the past we only tried to find a "ShAmt" that was a multiple of
"NewBW", but now we use a sliding window technique to scan for a viable
"ShAmt" that is a multiple of the byte size. This can help out finding
more opportunities for optimization (specially if the original type
isn't byte sized, and for big-endian targets when the original
load/store is aligned on the most significant bit).
For pre-ra scheduling, we use two options `-misched-topdown` and
`-misched-bottomup` to force the direction.
While for post-ra scheduling, we use `-misched-postra-direction`
with enumerated values (`topdown`, `bottomup` and `bidirectional`).
This is not unified and adds some mental burdens. Here we replace
these two options `-misched-topdown` and `-misched-bottomup` with
`-misched-prera-direction` with the same enumerated values.
To avoid the condition of `getNumOccurrences() > 0`, we add a new
enum value `Unspecified` and make it the default initial value.
These options are hidden, so we needn't keep the compatibility.
DAGCombiner::ReduceLoadOpStoreWidth could replace memory accesses
with more narrow loads/store, although sometimes the new load/store
would touch memory outside the original object. That seemed wrong
and this patch is simply avoiding doing the DAG combine in such
situations.
Also simplifying the expression used to align ShAmt down to a multiple
of NewBW. Subtracting (ShAmt % NewBW) should do the same thing as the
old more complicated expression.
Intention is to follow up with a patch that make more attempts, trying
to align the memory accesses at other offsets, allowing to trigger
the transform in more situations. The current strategy for deciding
size (NewBW) and offset (ShAmt) for the narrowed operations are a bit
ad-hoc, and not really considering big endian memory order in same
way as little endian.
Adding test cases related to narrowing of load-op-store sequences.
ReduceLoadOpStoreWidth isn't careful enough, so it may end up
creating load/store operations that access memory outside the region
touched by the original load/store. Using ARM as a target for the
test cases to show what happens for both little-endian and big-endian.
This patch also adds a way to override the TLI.isNarrowingProfitable
check in DAGCombiner::ReduceLoadOpStoreWidth by using the option
-combiner-reduce-load-op-store-width-force-narrowing-profitable.
Idea is that it should be simpler to for example add lit tests
verifying that the code is correct for big-endian (which otherwise
is difficult since there are no in-tree big-endian targets that
is overriding TLI.isNarrowingProfitable).
This is a pre-commit for
https://github.com/llvm/llvm-project/pull/119203
Re-landing #116970 after fixing miscompilation error.
The original change made it possible for CMPZ to have multiple uses;
`ARMDAGToDAGISel::SelectCMPZ` was not prepared for this.
Pull Request: https://github.com/llvm/llvm-project/pull/118887
Original commit message:
Following #116547 and #116676, this PR changes the type of results and
operands of some nodes to accept / return a normal type instead of Glue.
Unfortunately, changing the result type of one node requires changing
the operand types of all potential consumer nodes, which in turn
requires changing the result types of all other possible producer nodes.
So this is a bulk change.
This DAG combine was incorrect for big-endian targets, because it
assumes that when a bitcast changes the lane width, the
least-significant bits of the wider lanes are in the lower-numbered
lanes of the smaller type, which is only true for little-endian.
Following #116547 and #116676, this PR changes the type of results and
operands of some nodes to accept / return a normal type instead of Glue.
Unfortunately, changing the result type of one node requires changing
the operand types of all potential consumer nodes, which in turn
requires changing the result types of all other possible producer nodes.
So this is a bulk change.
Pull Request: https://github.com/llvm/llvm-project/pull/116970
After 9fe78db4, the pass inserts `store volatile i32 -1, ptr %call_site`
before all invoke instruction except the one in the entry block, which
has the effect of bypassing landing pads on exceptions.
When configuring the call site for a potentially throwing instruction
check that it is not `InvokeInst` -- they are handled by earlier code.
Following #116547, this changes the result of `ARMISD::CMPFP*` and the
operand of `ARMISD::FMSTAT` from a special `Glue` type to a normal type.
This change allows comparisons to be CSEd and scheduled around as can be
seen in the test changes.
Note that `ARMISD::FMSTAT` is still glued to its consumer nodes; this is
going to be changed in a separate patch.
This patch also sets `CopyCost` of `cl_FPSCR_NZCV` register class to a
negative value. The reason is the same as for CCR register class: it
makes DAG scheduler and InstrEmitter try to avoid copies of `FPCSR_NZCV`
register to / from virtual registers. Previously, this was not
necessary, since no attempt was made to create copies in the first
place.
`TRI::getCrossCopyRegClass` is modified in a way that prevents DAG
scheduler from copying FPSCR into a virtual register. The register
allocator might need to spill the virtual register, but that only seem
to work in Thumb mode.
Following #116547, this changes the result of `ARMISD::CMPFP*` and the
operand of `ARMISD::FMSTAT` from a special `Glue` type to a normal type.
This change allows comparisons to be CSEd and scheduled around as can be
seen in the test changes.
Note that `ARMISD::FMSTAT` is still glued to its consumer nodes; this is
going to be changed in a separate patch.
This patch also sets `CopyCost` of `cl_FPSCR_NZCV` register class to a
negative value. The reason is the same as for CCR register class: it
makes DAG scheduler and InstrEmitter try to avoid copies of `FPCSR_NZCV`
register to / from virtual registers. Previously, this was not
necessary, since no attempt was made to create copies in the first
place.
There might be a case when a copy can't be avoided (although not found
in existing tests). If a copy is necessary, the virtual register will be
created with `cl_FPSCR_NZCV` register class. If this register class is
inappropriate, `TRI::getCrossCopyRegClass` should be modified to return
the correct class.
Pull Request: https://github.com/llvm/llvm-project/pull/116676
1. When two (or more) nodes are glued, DAG scheduler will always
schedule them as one piece, i.e. it will not allow any instructions to
be scheduled between them. It does so because if nodes are glued this
usually means that there is an implicit register dependency between
them, and an intervening node could clobber this physical register. When
emitting such nodes into machine IR, they will also be stuck together,
e.g.:
```
%9:gpr = MOVsrl_glue killed %8, implicit-def $cpsr
%10:gpr = RRX %3, implicit $cpsr
```
2. If a node has Glue result, SelectionDAG will not try to CSE this
node. If it did, it would break the implicit physical register
dependency. In practice this means that if a node with Glue result has
multiple uses, it has to be duplicated before each use. This the reason
for `ARMTargetLowering::duplicateCmp` to exist.
When using normal data dependency, dependent nodes can freely be
scheduled around. If there is a physical register dependency between
nodes, the physical register will be copied to/from a virtual register,
allowing other nodes to intervene between them. The resulting machine IR
might look like this:
```
%9:gpr = LSRs1 killed %8, implicit-def $cpsr
%10:gpr = COPY $cpsr
%11:gpr = ORRrsi killed %9, %3, 242, 14 /* CC::al */, $noreg, $noreg
%12:gpr = BICri killed %11, -2147483648, 14 /* CC::al */, $noreg, $noreg
$cpsr = COPY %10
%13:gpr = RRX %3, implicit $cpsr
```
The two copies are likely to be eliminated by register coalescer, given
that there are no instructions between them that clobber this physical
register. If the copies are unwanted in the first place (they could be
expensive or impossible), DAG scheduler will try to avoid inserting them
wherever possible, and the resulting machine IR will look like this:
```
%9:gpr = LSRs1 killed %8, implicit-def $cpsr
%10:gpr = ORRrsi killed %9, %3, 242, 14 /* CC::al */, $noreg, $noreg
%11:gpr = BICri killed %10, -2147483648, 14 /* CC::al */, $noreg, $noreg
%12:gpr = RRX %3, implicit $cpsr
```
On ARM, arithmetic operations and LSLS already use the new data flow
approach. This patch extends it to include 1-bit shifts.
Pull Request: https://github.com/llvm/llvm-project/pull/116547
This change is part of this proposal:
https://discourse.llvm.org/t/rfc-all-the-math-intrinsics/78294
- `Builtins.td` - Add f16 support for libm atan2 builtin
- `CGBuiltin.cpp` - Emit constraint atan2 intrinsic for clang builtin
- `clang/test/CodeGenCXX/builtin-calling-conv.cpp` - Use erff instead of
atan2 for clang builtin to lib call calling convention check, now that
atan2 maps to an intrinsic.
- add atan2 cases to llvm.experimental.constrained tests for more
backends: ARM, PowerPC, RISCV, SystemZ.
- LangRef.rst: add llvm.experimental.constrained.atan2, revise
llvm.atan2 description.
Last part of Implement the atan2 HLSL Function. Fixes#70096.
This reverts commit c31014322c0b5ae596da129cbb844fb2198b4ef4.
Based on the discussions in #112772, this pass is not needed after the
introduction of `llvm.threadlocal.address` intrinsic.
Fixes https://github.com/llvm/llvm-project/issues/112771.
... matching what we have in the disassembler. This isn't turned on by
default since several of the scheduling models are not completely
accurate, and we don't want to be misleading.
If we're masking the LSB of a SRL node result and that is shifting down an extended sign bit, see if we can change the SRL to shift down the MSB directly.
These patterns can occur during legalisation when we've sign extended to a wider type but the SRL is still shifting from the subreg.
Alternative to #114967
Fixes the remaining regression in #112588
When doing a call from CMSE secure state to non-secure state for
v8-M.main, we use the VLLDM and VLSTM instructions to save, clear and
restore the FP registers around the call. These instructions both check
the CONTROL_S.SFPA bit, and if it is clear (meaning the current contents
of the FP registers are not secret) they execute as no-ops.
This causes a problem when CONTROL_S.SFPA==0 before the call, which
happens if there are no floating-point instructions executed between
entry to secure state and the call. If this is the case, then the VLSTM
instruction will do nothing, leaving the save area in the stack
uninitialised. If the called function returns a value in floating-point
registers, the call sequence includes an instruction to copy the return
value from a floating-point register to a GPR, which must be before the
VLLDM instruction. This copy sets CONTROL_S.SFPA, meaning that the VLLDM
will fully execute, and load the uninitialised stack memory into the FP
registers.
This causes two problems:
* The FP register file is clobbered, including all of the callee-saved
registers, which might contain live values.
* The stack region might contain secret values, which will be leaked to
non-secure state through the floating-point registers if/when we
return to non-secure state.
The fix is to insert a `vmov s0, s0` instruction before the VLSTM
instruction, to ensure that CONTROL_S.SFPA is set for both the VLLDM and
VLSTM instruction.
CVE: https://www.cve.org/cverecord?id=CVE-2024-7883
Security bulletin:
https://developer.arm.com/Arm%20Security%20Center/Cortex-M%20Security%20Extensions%20Vulnerability
This adds the `llvm.sincos` intrinsic, legalization, and lowering.
The `llvm.sincos` intrinsic takes a floating-point value and returns
both the sine and cosine (as a struct).
```
declare { float, float } @llvm.sincos.f32(float %Val)
declare { double, double } @llvm.sincos.f64(double %Val)
declare { x86_fp80, x86_fp80 } @llvm.sincos.f80(x86_fp80 %Val)
declare { fp128, fp128 } @llvm.sincos.f128(fp128 %Val)
declare { ppc_fp128, ppc_fp128 } @llvm.sincos.ppcf128(ppc_fp128 %Val)
declare { <4 x float>, <4 x float> } @llvm.sincos.v4f32(<4 x float> %Val)
```
The lowering is built on top of the existing FSINCOS ISD node, with
additional type legalization to allow for f16, f128, and vector values.
Some tests contain errors in constrained intrinsic usage, such as missed
or extra type parameters, wrong type parameters order and some other.
---------
Co-authored-by: Andy Kaylor <andy_kaylor@yahoo.com>
We don't need to copy byval arguments to tail calls via a temporary, if
we can prove that we are not copying from the outgoing argument area.
This patch does this when the source if the argument is one of:
* Memory in the local stack frame, which can't be used for tail-call
arguments.
* A global variable.
We can also avoid doing the copy completely if the source and
destination are the same memory location, which is the case when the
caller and callee have the same signature, and pass some arguments
through unmodified.
When passing byval arguments to tail-calls, we need to store them into
the stack memory in which this the caller received it's arguments. If
any of the outgoing arguments are forwarded from incoming byval
arguments, then the source of the copy is from the same stack memory.
This can result in the copy corrupting a value which is still to be
read.
The fix is to first make a copy of the outgoing byval arguments in local
stack space, and then copy them to their final location. This fixes the
correctness issue, but results in extra copying, which could be
optimised.
Byval arguments which are passed partially in registers get stored into
the local stack frame, but it is valid to tail-call them because the
part which gets spilled is always re-loaded into registers before doing
the tail-call, so it's OK for the spill area to be deallocated.
The ARM backend was checking that the outgoing values for a tail-call
matched the incoming argument values of the caller. This isn't
necessary, because the caller can change the values in both registers
and the stack before doing the tail-call. The actual limitation is that
the callee can't need more stack space for it's arguments than the
caller does.
This is needed for code using the musttail attribute, as well as
enabling tail calls as an optimisation in more cases.