This reverts commit 58cc1675ec7b4aa5bc2dab56180cb7af1b23ade5.
I also made the incorrect assumption that we know both values are
+/-0.0 here as well. Revert for now.
When ordering signed zero, only check the sign of one of the values. We
already know at this point that both values must be +/-0.0, so it is
sufficient to check one of them to correctly order them.
For example, for fmaximum, if we know LHS is `+0.0` then we can always
select LHS, value of RHS does not matter. If LHS is `-0.0` we can always
select RHS, value of RHS doesn't matter.
FMAXIMUM is currently legalized via IS_FPCLASS for the signed zero
handling. This is problematic, because it assumes the equivalent integer
type is legal. Many targets have legal fp128, but illegal i128, so this
results in legalization failures.
Fix this by replacing IS_FPCLASS with checking the bitcast to integer
instead. In that case it is sufficient to use any legal integer type, as
we're just interested in the sign bit. This can be obtained via a stack
temporary cast. There is existing FloatSignAsInt functionality used for
legalization of FABS and similar we can use for this purpose.
Fixes https://github.com/llvm/llvm-project/issues/139380.
Fixes https://github.com/llvm/llvm-project/issues/139381.
Fixes https://github.com/llvm/llvm-project/issues/140445.
The .syntax unified directive and .codeX/.code X directives are, other
than some simple common printing code, exclusively implemented in the
targets themselves. Thus, remove the corresponding MCAF_* flags and
reimplement the directives solely within the targets. This avoids
exposing all targets to all other targets' flags.
Since MCAF_SubsectionsViaSymbols is all that remains, convert it to its
own function like other directives, simplifying its implementation.
Note that, on X86, we now always need a target streamer when parsing
assembly, as it's now used for directives that aren't COFF-specific. It
still does not however need to do anything when producing a non-COFF
object file, so this commit does not introduce any new target streamers.
There is some churn in test output, and corresponding UTC regex changes,
due to comments no longer being flushed by these various directives (and
EmitEOL is not exposed outside MCAsmStreamer.cpp so we couldn't do so
even if we wanted to), but that was a bit odd to be doing anyway.
This is motivated by Morello LLVM, which adds yet another assembler flag
to distinguish A64 and C64 instruction sets, but did not update every
switch and so emits warnings during the build. Rather than fix those
warnings it seems better to instead make the problem not exist in the
first place via this change.
It is used to mark a value that we are sure that it is not some fcType.
The examples include:
* An arguments of a function is marked with nofpclass
* Output value of an intrinsic can be sure to not be some type
So that the following operation can make some assumptions.
This reverts commit 9d90f8ba7113fd9c7b2662682ad94b744ed2b78c.
Test fails the machine verifier. There's a bug somewhere, the
unrepresentable cases should be avoided by the default logic.
All of the overrides of shouldRewriteCopySrc appear to be hacks
for bugs in the base implementation, so I'm trying to delete
all of the overrides. I was expecting this to find an example
issue like the x86 version, but no tests change with this.
The code below the removed check looks generic enough to support
arbitrary integer widths. This change helps 32-bit targets avoid
expensive expansion/libcalls in the case of zero input.
Pull Request: https://github.com/llvm/llvm-project/pull/137197
When floating-point operations are legalized to operations of a higher
precision (e.g. f16 fadd being legalized to f32 fadd) then we get
narrowing then widening operations between each operation. With the
appropriate fast math flags (nnan ninf contract) we can eliminate these
casts.
This is a reland of #99752 with the bug fixed (see test diff in the
third commit in this PR).
All `popcount` libcalls return `int`, but `ISD::CTPOP` returns the type
of the argument, which can be wider than `int`. The fix is to make DAG
legalizer pass the correct return type to `makeLibCall` and sign-extend
the result afterwards.
Original commit message:
The main change is adding CTPOP to `RuntimeLibcalls.def` to allow
targets to use LibCall action for CTPOP. DAG legalizers are changed
accordingly.
Pull Request: https://github.com/llvm/llvm-project/pull/101786
FPSCR and FPEXC will be stored in FPStatusRegs, after GPRCS2 has been
saved.
- GPRCS1
- GPRCS2
- FPStatusRegs (new)
- DPRCS
- GPRCS3
- DPRCS2
FPSCR is present on all targets with a VFP, but the FPEXC register is
not present on Cortex-M devices, so different amounts of bytes are
being pushed onto the stack depending on our target, which would
affect alignment for subsequent saves.
DPRCS1 will sum up all previous bytes that were saved, and will emit
extra instructions to ensure that its alignment is correct. My
assumption is that if DPRCS1 is able to correct its alignment to be
correct, then all subsequent saves will also have correct alignment.
Avoid annotating the saving of FPSCR and FPEXC for functions marked
with the interrupt_save_fp attribute, even though this is done as part
of frame setup. Since these are status registers, there really is no
viable way of annotating this. Since these aren't GPRs or DPRs, they
can't be used with .save or .vsave directives. Instead, just record
that the intermediate registers r4 and r5 are saved to the stack
again.
Co-authored-by: Jake Vossen <jake@vossen.dev>
Co-authored-by: Alan Phipps <a-phipps@ti.com>
https://reviews.llvm.org/D17938 introduced lowerRelativeReference to
give ConstantExpr sub (A-B) special semantics in ELF: when `A` is an
`unnamed_addr` function, create a PLT-generating relocation. This was
intended for C++ relative vtables, but C++ relative vtable ended up
using DSOLocalEquivalent (lowerDSOLocalEquivalent).
This special treatment of `unnamed_addr` seems unusual.
Let's remove it. Only COFF needs an overload to generate a @IMGREL32
relocation specifier (llvm/test/MC/COFF/cross-section-relative.ll).
Pull Request: https://github.com/llvm/llvm-project/pull/134781
Fixes#102053
The check was added in 8decdc472f308b13d7fb7fd50c3919db086c0417, and at
the time iOS 5 was the latest iOS version, before that commit tail calls
were disabled for all ARMv7 targets. Testing a build of wasm3 with the
patch on a device running iOS 3.0 shows a noticeable performance
improvement and no issues.
Almost all of the rotate idioms that are valid for an 'or' are also
valid when the halves are combined with an 'add'. Further, many of these
cases are not handled by common bits tracking meaning that the 'add' is
not converted to a 'disjoint or'.
https://reviews.llvm.org/D17938 introduced lowerRelativeReference to
give ConstantExpr sub (A-B) special semantics in ELF: when `A` is an
`unnamed_addr` function, create a PLT-generating relocation. This was
intended for C++ relative vtables, but C++ relative vtable ended up
using DSOLocalEquivalent (lowerDSOLocalEquivalent).
This special treatment of `unnamed_addr` seems unusual.
Let's remove it. Only COFF needs an overload to generate a @IMGREL32
relocation specifier (llvm/test/MC/COFF/cross-section-relative.ll).
Pull Request: https://github.com/llvm/llvm-project/pull/132684
The existing pretty printer generates excessive parentheses for
MCBinaryExpr expressions. This update removes unnecessary parentheses
of MCBinaryExpr with +/- operators and MCUnaryExpr.
Since relocatable expressions only use + and -, this change improves
readability in most cases.
Examples:
- (SymA - SymB) + C now prints as SymA - SymB + C.
This updates the output of -fexperimental-relative-c++-abi-vtables for
AArch64 and x86 to `.long _ZN1B3fooEv@PLT-_ZTV1B-8`
- expr + (MCTargetExpr) now prints as expr + MCTargetExpr, with this
change primarily affecting AMDGPUMCExpr.
Add tests to various targets covering rotate idioms where an 'ADD' node
is used to combine the halves instead of an 'OR'. Some of these cases
will be better optimized following #125612, while others are already
well optimized or do not have a valid fold to a rotate or funnel-shift.
These date back to when the non-intrinsic format of variable locations
was still being tested and was behind a compile-time flag, so not all
builds / bots would correctly run them. The solution at the time, to get
at least some test coverage, was to have tests opt-in to non-intrinsic
debug-info if it was built into LLVM.
Nowadays, non-intrinsic format is the default and has been on for more
than a year, there's no need for this flag to exist.
(I've downgraded the flag from "try" to explicitly requesting
non-intrinsic format in some places, so that we can deal with tests that
are explicitly about non-intrinsic format in their own commit).
This is meant as a preparation for PR #130988 "[AMDGPU] Implement IR
expansion for frem instruction" which implements the expansion of
another instruction in this pass. The more general name seems more
appropriate given this change and quite reasonable even without it.
FPSCR and FPEXC will be stored in FPStatusRegs, after GPRCS2 has been
saved.
- GPRCS1
- GPRCS2
- FPStatusRegs (new)
- DPRCS
- GPRCS3
- DPRCS2
FPSCR is present on all targets with a VFP, but the FPEXC register is
not present on Cortex-M devices, so different amounts of bytes are
being pushed onto the stack depending on our target, which would
affect alignment for subsequent saves.
DPRCS1 will sum up all previous bytes that were saved, and will emit
extra instructions to ensure that its alignment is correct. My
assumption is that if DPRCS1 is able to correct its alignment to be
correct, then all subsequent saves will also have correct alignment.
Avoid annotating the saving of FPSCR and FPEXC for functions marked
with the interrupt_save_fp attribute, even though this is done as part
of frame setup. Since these are status registers, there really is no
viable way of annotating this. Since these aren't GPRs or DPRs, they
can't be used with .save or .vsave directives. Instead, just record
that the intermediate registers r4 and r5 are saved to the stack
again.
Co-authored-by: Jake Vossen <jake@vossen.dev>
Co-authored-by: Alan Phipps <a-phipps@ti.com>
When we have a floating-point operation that a target doesn't support
for a given type, but does support for a wider type, then there are two
ways this can be handled:
* If the target doesn't have any registers at all of this type then
LegalizeTypes will convert the operation.
* If we do have registers but no operation for this type, then the
operation action will be Promote and it's handled in PromoteNode.
In both cases the operation at the wider type, and the conversion
operations to and from that type, should have the same fast math flags
as the original operation.
This is being done in preparation for a DAGCombine patch which makes use
of these fast math flags.
This fixes an expensive chesk failure after 8476a5d480304. The issue
was essentially that getRegClassConstraintEffectForVReg was not doing
anything useful, sometimes. If the register passed to it is not present
in the instruction, it is a no-op and returns the original classe. The
Edit->getReg() register may not be the register as it appears in either
the use or def instruction. It may be some split register, so take
the register directly from the instruction being rematerialized.
Also directly query the constraint from the def instruction, with a
hardcoded operand index. This isn't ideal, but all the other
rematerialize
code makes the same assumption.
So far I've been unable to reproduce this with a standalone MIR test. In
the
original case, stop-before=greedy and running the one pass is not
working.
Following 15e295d the machine scheduler no longer filters-out single-MI
regions when emitting regions to schedule. While this has no functional
impact at the moment, it generally has a negative compile-time impact
(see #128739).
Since all targets but AMDGPU do not care for this behavior, this
introduces an off-by-default flag to `ScheduleDAGInstrs` to control
whether such regions are going to be scheduled, effectively reverting
15e295d for all targets but AMDGPU (currently the only target enabling
this flag).
The MI scheduler skips regions containing a single MI during scheduling.
This can prevent targets that perform multi-stage scheduling and move
MIs between regions during some stages to reason correctly about the
entire IR, since some MIs will not be assigned to a region at the
beginning.
This makes the machine scheduler no longer skip single-MI regions. Only
a few unit tests are affected (mainly those which check for the
scheduler's debug output).
Even if an inline asm doesn't have memory effects, we can't assume it's
safe to speculate: it could trap, or cause undefined behavior. At the
LLVM IR level, this is handled correctly: we don't speculate inline asm
(unless it's marked "speculatable", but I don't think anyone does that).
Codegen also needs to respect this restriction.
This change stops Early If Conversion and similar passes from
speculating an INLINEASM MachineInstr.
Some uses of isSafeToMove probably could be switched to a different API:
isSafeToMove assumes you're hoisting, but we could handle some forms of
sinking more aggressively. But I'll leave that for a followup, if it
turns out to be relevant.
See also discussion on gcc bugtracker
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102150 .
The Target hook convertSelectOfConstantsToMath() needs to be used within
SimplifySelectCC helper combine function in SelectionDAG Isel, where
generic select folding with constants is happening into simple maths op
using the condition as it is.
It necessarily fixes#121145.
The hasOneUse check was failing in any case where the load was part of a chain - we should only be checking if the loaded value has one use, and any updates to the chain should be handled by the fold calling shouldReduceLoadWidth.
I've updated the x86 implementation to match, although it has no effect here yet (I'm still looking at how to improve the x86 implementation) as the inner for loop was discarding chain uses anyway.
By using SDValue::hasOneUse instead this patch exposes a missing dependency on the LLVMSelectionDAG library in a lot of tools + unittests, which resulted in having to make SDNode::hasNUsesOfValue inline.
Noticed while fighting the x86 regressions in #122671
`RegisterClassInfo` was supposed to be kept alive between pass runs,
which wasn't being done leading to recomputations increasing the compile
time.
Now the Impl class is a member of the legacy and new passes so that it
is not reconstructed on every pass run.
---------
Co-authored-by: Christudasan Devadasan <christudasan.devadasan@amd.com>
Fix the case where the vector element type of the loaded extractelement
input does not match the result type of the extract.
This fixes a regression reported after
c55a7659b38946350315ac4a18d9805deb1f0a54
This fixes the handling of subregister extract copies. This
will allow AMDGPU to remove its implementation of
shouldRewriteCopySrc, which exists as a 10 year old workaround
to this bug. peephole-opt-fold-reg-sequence-subreg.mir will
show the expected improvement once the custom implementation
is removed.
The copy coalescing processing here is overly abstracted
from what's actually happening. Previously when visiting
coalescable copy-like instructions, we would parse the
sources one at a time and then pass the def of the root
instruction into findNextSource. This means that the
first thing the new ValueTracker constructed would do
is getVRegDef to find the instruction we are currently
processing. This adds an unnecessary step, placing
a useless entry in the RewriteMap, and required skipping
the no-op case where getNewSource would return the original
source operand. This was a problem since in the case
of a subregister extract, shouldRewriteCopySource would always
say that it is useful to rewrite and the use-def chain walk
would abort, returning the original operand. Move the process
to start looking at the source operand to begin with.
This does not fix the confused handling in the uncoalescable
copy case which is proving to be more difficult. Some currently
handled cases have multiple defs from a single source, and other
handled cases have 0 input operands. It would be simpler if
this was implemented with isCopyLikeInstr, rather than guessing
at the operand structure as it does now.
There are some improvements and some regressions. The
regressions appear to be downstream issues for the most part. One
of the uglier regressions is in PPC, where a sequence of insert_subrgs
is used to build registers. I opened #125502 to use reg_sequence instead,
which may help.
The worst regression is an absurd SPARC testcase using a <251 x fp128>,
which uses a very long chain of insert_subregs.
We need improved subregister handling locally in PeepholeOptimizer,
and other pasess like MachineCSE to fix some of the other regressions.
We should handle subregister composes and folding more indexes
into insert_subreg and reg_sequence.