Cause:
1. `implicit_def` inside bundle does not count for define of reg in
machineinst verifier
2. Including `implicit_def` will cause relative reg not define, result
in `Bad machine code: Using an undefined physical register` in the
machineinst verifier
Fixes https://github.com/llvm/llvm-project/issues/139102
---------
Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com>
In review of bbde6b, I had originally proposed that we support the
legacy text format. As review evolved, it bacame clear this had been a
bad idea (too much complexity), but in order to let that patch finally
move forward, I approved the change with the variant. This change undoes
the variant, and updates all the tests to just use the array form.
lifetime.start and lifetime.end are primarily intended for use on
allocas, to enable stack coloring and other liveness optimizations. This
is necessary because all (static) allocas are hoisted into the entry
block, so lifetime markers are the only way to convey the actual
lifetimes.
However, lifetime.start and lifetime.end are currently *allowed* to be
used on non-alloca pointers. We don't actually do this in practice, but
just the mere fact that this is possible breaks the core purpose of the
lifetime markers, which is stack coloring of allocas. Stack coloring can
only work correctly if all lifetime markers for an alloca are
analyzable.
* If a lifetime marker may operate on multiple allocas via a select/phi,
we don't know which lifetime actually starts/ends and handle it
incorrectly (https://github.com/llvm/llvm-project/issues/104776).
* Stack coloring operates on the assumption that all lifetime markers
are visible, and not, for example, hidden behind a function call or
escaped pointer. It's not possible to change this, as part of the
purpose of lifetime markers is that they work even in the presence of
escaped pointers, where simple use analysis is insufficient.
I don't think there is any way to have coherent semantics for lifetime
markers on allocas, while also permitting them on arbitrary pointer
values.
This PR restricts lifetimes to operate on allocas only. As a followup, I
will also drop the size argument, which is superfluous if we always
operate on an alloca. (This change also renders various code handling
lifetime markers on non-alloca dead. I plan to clean up that kind of
code after dropping the size argument as well.)
In practice, I've only found a few places that currently produce
lifetimes on non-allocas:
* CoroEarly replaces the promise alloca with the result of an intrinsic,
which will later be replaced back with an alloca. I think this is the
only place where there is some legitimate loss of functionality, but I
don't think this is particularly important (I don't think we'd expect
the promise in a coroutine to admit useful lifetime optimization.)
* SafeStack moves unsafe allocas onto a separate frame. We can safely
drop lifetimes here, as SafeStack performs its own stack coloring.
* Similar for AddressSanitizer, it also moves allocas into separate
memory.
* LSR sometimes replaces the lifetime argument with a GEP chain of the
alloca (where the offsets ultimately cancel out). This is just
unnecessary. (Fixed separately in
https://github.com/llvm/llvm-project/pull/149492.)
* InferAddrSpaces sometimes makes lifetimes operate on an addrspacecast
of an alloca. I don't think this is necessary.
Currently we always produce a cost of 1 for all CostKinds that are not
RecipThroughput, which can underestimate the cost if the type has a
higher legalization cost (like larger vectors). This relaxes it to cover
all cost kinds.
A hardware loop instruction combines a subtract, compare with zero, and
branch. We currently account for the compare and branch being combined
into one in Cost::RateFormula, as part of more general handling for
compare-branch-zero, but don't account for the subtract, leading to
suboptimal decisions in some cases.
Fix this in Cost::RateRegister by noticing when we have such a subtract
and discounting the AddRecCost in such a case.
The insertion point of COPY isn't always optimal and could eventually
lead to a worse block layout, see the regression test in the first
commit.
This change affects many architectures but the amount of total
instructions in the test cases seems too be slightly lower.
bics is available on ARM.
USAT regressions are to be fixed after this because that is an issue
with the ARMISelLowering and should be another PR.
Note that opt optimizes those testcases to min/max intrinsics anyway so
this should have no real effect on codegen.
Proof: https://alive2.llvm.org/ce/z/kPVQ3_
This patch reassociates `add(add(vecreduce(a), b), add(vecreduce(c),
d))` into `add(vecreduce(add(a, c)), add(b, d))`, to combine the
reductions into a single node. This comes up after unrolling vectorized
loops.
There is another small change to move reassociateReduction inside fadd
outside of a AllowNewConst block, as new constants will not be created
and it should be OK to perform the combine later after legalization.
This saves 2 instructions in the ARM soft float case for fcmp ueq.
This code is written in an confusingly overly general way. The point
of getCmpLibcallCC is to express that the compiler-rt implementations
of the FP compares are different aliases around functions which may
return -1 in some cases. This does not apply to the call for unordered,
which returns a normal boolean.
Also stop overriding the default value for the unordered compare for ARM.
This was setting it to the same value as the default, which is now assumed.
Now that #141786 handles scalar and neon types, this adds MVE
definitions and legalization for llvm.roundeven intrinsics. The existing
llvm.arm.mve.vrintn are auto-upgraded to llvm.roundeven like other vrint
instructions, so should continue to work.
When floating-point operations are legalized to operations of a higher
precision (e.g. f16 fadd being legalized to f32 fadd) then we get
narrowing then widening operations between each operation. With the
appropriate fast math flags (nnan ninf contract) we can eliminate these
casts.
This is a reland of #99752 with the bug fixed (see test diff in the
third commit in this PR).
All `popcount` libcalls return `int`, but `ISD::CTPOP` returns the type
of the argument, which can be wider than `int`. The fix is to make DAG
legalizer pass the correct return type to `makeLibCall` and sign-extend
the result afterwards.
Original commit message:
The main change is adding CTPOP to `RuntimeLibcalls.def` to allow
targets to use LibCall action for CTPOP. DAG legalizers are changed
accordingly.
Pull Request: https://github.com/llvm/llvm-project/pull/101786
This adds a call to SimplifyDemandedBits from bitcasts with scalar input
types in SimplifyDemandedVectorElts, which can help simplify the input
scalar.
The existing pretty printer generates excessive parentheses for
MCBinaryExpr expressions. This update removes unnecessary parentheses
of MCBinaryExpr with +/- operators and MCUnaryExpr.
Since relocatable expressions only use + and -, this change improves
readability in most cases.
Examples:
- (SymA - SymB) + C now prints as SymA - SymB + C.
This updates the output of -fexperimental-relative-c++-abi-vtables for
AArch64 and x86 to `.long _ZN1B3fooEv@PLT-_ZTV1B-8`
- expr + (MCTargetExpr) now prints as expr + MCTargetExpr, with this
change primarily affecting AMDGPUMCExpr.
Merged two loops that were iterating over the same machine basic block
into one, also did some minor readability improvements (variable
renaming, commenting and absorbing if condition into a variable)
This is a quick fix for EXPENSIVE_CHECKS bot failures. I still think we
could
defer looking for a compatible subregister further up the use-def chain,
and
should be able to check compatibilty with the ultimate found source.
AAPCS32 defines the fp16 and bf16 types as being passed as if they were
extended to 32 bits, with the high 16 bits being unspecified. The
extension is specified as happening as-if it was done in a register,
which means that for big endian targets, the actual value gets passed in
the higher addressed half of the stack slot, instead of the lower
addressed half as for little endian. Previously, for targets with the
fp16 extension, we were passing these types as a 16 bit stack slot,
which worked for little endian because every later stack slot would be
4-byte aligned leaving the 2 byte gap, but was incorrect for big endian.
This fixes the handling of subregister extract copies. This
will allow AMDGPU to remove its implementation of
shouldRewriteCopySrc, which exists as a 10 year old workaround
to this bug. peephole-opt-fold-reg-sequence-subreg.mir will
show the expected improvement once the custom implementation
is removed.
The copy coalescing processing here is overly abstracted
from what's actually happening. Previously when visiting
coalescable copy-like instructions, we would parse the
sources one at a time and then pass the def of the root
instruction into findNextSource. This means that the
first thing the new ValueTracker constructed would do
is getVRegDef to find the instruction we are currently
processing. This adds an unnecessary step, placing
a useless entry in the RewriteMap, and required skipping
the no-op case where getNewSource would return the original
source operand. This was a problem since in the case
of a subregister extract, shouldRewriteCopySource would always
say that it is useful to rewrite and the use-def chain walk
would abort, returning the original operand. Move the process
to start looking at the source operand to begin with.
This does not fix the confused handling in the uncoalescable
copy case which is proving to be more difficult. Some currently
handled cases have multiple defs from a single source, and other
handled cases have 0 input operands. It would be simpler if
this was implemented with isCopyLikeInstr, rather than guessing
at the operand structure as it does now.
There are some improvements and some regressions. The
regressions appear to be downstream issues for the most part. One
of the uglier regressions is in PPC, where a sequence of insert_subrgs
is used to build registers. I opened #125502 to use reg_sequence instead,
which may help.
The worst regression is an absurd SPARC testcase using a <251 x fp128>,
which uses a very long chain of insert_subregs.
We need improved subregister handling locally in PeepholeOptimizer,
and other pasess like MachineCSE to fix some of the other regressions.
We should handle subregister composes and folding more indexes
into insert_subreg and reg_sequence.
`RegisterClassInfo::getRegPressureSetLimit` is a wrapper of
`TargetRegisterInfo::getRegPressureSetLimit` with some logics to
adjust the limit by removing reserved registers.
It seems that we shouldn't use
`TargetRegisterInfo::getRegPressureSetLimit`
directly, just like the comment "This limit must be adjusted
dynamically for reserved registers" said.
Separate from https://github.com/llvm/llvm-project/pull/118787
This helps with +fp64 targets where the f64s are legal and not previously
lowered. It can treat fpextends as a shift + cvt and fptrunc can use a libcall.
MinNoSplitDisp was first introduced in D16890 to handle cases where the
ConstantIslands pass fails to converge in the presence of big basic
blocks. However, the computation of the variable seems to be wrong as it
currently computes the offset immediately following UserBB. In other
words, it represents the distance from the beginning of the function to
the end of UserBB. The distance from the beginning of the function does
not seem to be a good indicator of how big the basic block is unless the
basic block is close to the beginning of the function. I think
MinNoSplitDisp should compute the distance between UserOffset and the
end of UserBB instead.
Re-landing #116970 after fixing miscompilation error.
The original change made it possible for CMPZ to have multiple uses;
`ARMDAGToDAGISel::SelectCMPZ` was not prepared for this.
Pull Request: https://github.com/llvm/llvm-project/pull/118887
Original commit message:
Following #116547 and #116676, this PR changes the type of results and
operands of some nodes to accept / return a normal type instead of Glue.
Unfortunately, changing the result type of one node requires changing
the operand types of all potential consumer nodes, which in turn
requires changing the result types of all other possible producer nodes.
So this is a bulk change.
There were two bugs in the implementation of the MVE vsbciq (subtract
with carry across vector, with initial carry value) intrinsics:
* The VSBCI instruction behaves as if the carry-in is always set, but we
were selecting it when the carry-in is clear.
* The vsbciq intrinsics should generate IR with the carry-in set, but
they were leaving it clear.
These two bugs almost cancelled each other out, but resulted in
incorrect code when the vsbcq intrinsics (with a carry-in) were used,
and the carry-in was a compile time constant.
Following #116547 and #116676, this PR changes the type of results and
operands of some nodes to accept / return a normal type instead of Glue.
Unfortunately, changing the result type of one node requires changing
the operand types of all potential consumer nodes, which in turn
requires changing the result types of all other possible producer nodes.
So this is a bulk change.
Pull Request: https://github.com/llvm/llvm-project/pull/116970
Following #116547, this changes the result of `ARMISD::CMPFP*` and the
operand of `ARMISD::FMSTAT` from a special `Glue` type to a normal type.
This change allows comparisons to be CSEd and scheduled around as can be
seen in the test changes.
Note that `ARMISD::FMSTAT` is still glued to its consumer nodes; this is
going to be changed in a separate patch.
This patch also sets `CopyCost` of `cl_FPSCR_NZCV` register class to a
negative value. The reason is the same as for CCR register class: it
makes DAG scheduler and InstrEmitter try to avoid copies of `FPCSR_NZCV`
register to / from virtual registers. Previously, this was not
necessary, since no attempt was made to create copies in the first
place.
`TRI::getCrossCopyRegClass` is modified in a way that prevents DAG
scheduler from copying FPSCR into a virtual register. The register
allocator might need to spill the virtual register, but that only seem
to work in Thumb mode.
Following #116547, this changes the result of `ARMISD::CMPFP*` and the
operand of `ARMISD::FMSTAT` from a special `Glue` type to a normal type.
This change allows comparisons to be CSEd and scheduled around as can be
seen in the test changes.
Note that `ARMISD::FMSTAT` is still glued to its consumer nodes; this is
going to be changed in a separate patch.
This patch also sets `CopyCost` of `cl_FPSCR_NZCV` register class to a
negative value. The reason is the same as for CCR register class: it
makes DAG scheduler and InstrEmitter try to avoid copies of `FPCSR_NZCV`
register to / from virtual registers. Previously, this was not
necessary, since no attempt was made to create copies in the first
place.
There might be a case when a copy can't be avoided (although not found
in existing tests). If a copy is necessary, the virtual register will be
created with `cl_FPSCR_NZCV` register class. If this register class is
inappropriate, `TRI::getCrossCopyRegClass` should be modified to return
the correct class.
Pull Request: https://github.com/llvm/llvm-project/pull/116676
For most MVE predicated FMA instructions, disabled lanes will contain
the value in the addend operand. However, The VFMAS instruction takes
the addend in a GPR, and the output register is shared with the first
multiply operand, so disabled lanes will get that value instead. This
means that we can't use the same intrinsic as for the other VFMA
instructions. Instead, we can codegen the vfmas intrinsic to a regular
FMA and select in clang, which the backend already has the patterns to
select VFMAS from.
The MVE VADC and VSBC instructions read and write a carry bit in FPSCR,
which is exposed through the intrinsics. This makes it possible to write
code which has the FPSCR live across a function call, or which uses the
same value twice, so it needs to be possible to spill and reload it.
There is a missed optimisation in one of the test cases, where we reload
the FPSCR from the stack despite it still being live, I've not found a
simple way to prevent the register allocator from doing this.