Here we introduce three new GMIR instructions to cover a set of trap
intrinsics. The idea behind it is that generic intrinsics shouldn't be
used with G_INTRINSIC opcode.
These new instructions can match perfectly with existing trap ISD nodes.
It allows X86, AArch64, RISCV and Mips to reuse SelectionDAG patterns for
selection and avoid manual selection. However AMDGPU is an exception. It
selects traps during legalization regardless SelectionDAG or GlobalISel.
Since there are not many places where traps are used, this change
attempts to clean up all the usages of G_INTRINSIC with trap intrinsics. So,
there is no stage when both G_TRAP and
G_INTRINSIC_W_SIDE_EFFECTS(@llvm.trap) are allowed.
MIPSr6 ISA requires normal load/store instructions support
misunaligned memory access, while it is not always do so
by hardware. On some microarchitectures or some corner cases
it may need support by OS.
Don't confuse with pre-R6's lwl/lwr famlily: MIPSr6 doesn't
support them, instead, r6 requires lw instruction support
misunaligned memory access. So, if -mstrict-align is used for
pre-R6, lwl/lwr won't be disabled.
If -mstrict-align is used for r6 and the access is not well
aligned, some lb/lh instructions will be used to replace lw.
This is useful for OS kernels.
To be back-compatible with GCC, -m(no-)unaligned-access are also
added as Neg-Alias of -m(no-)strict-align.
- Use computeMaxCallFrameSize() in PEI::calculateCallFrameInfo() instead of duplicating the code.
- Set AdjustsStack in FinalizeISel instead of in computeMaxCallFrameSize().
This include 2 fixes:
1. Disallow 'f' for softfloat.
2. Allow 'r' for softfloat.
Currently, 'f' is accpeted by clang, then LLVM meets an internal error.
'r' is rejected by LLVM by: couldn't allocate input reg for constraint
'r'.
Fixes: #64241, #63632
---------
Co-authored-by: Fangrui Song <i@maskray.me>
…n MIPS
Modify:
Add a global variable 'CurForbiddenSlotAttr' to save current
instruction's forbidden slot and whether set reorder. This is the
judgment condition for whether to add nop. We would add a couple of
'.set noreorder' and '.set reorder' to wrap the current instruction and
the next instruction.
Then we can get previous instruction`s forbidden slot attribute and
whether set reorder by 'CurForbiddenSlotAttr'.
If previous instruction has forbidden slot and .set reorder is active
and current instruction is CTI. Then emit a NOP after it.
Fix https://github.com/llvm/llvm-project/issues/61045.
Because https://reviews.llvm.org/D158589 was 'Needs Review' state, not
ending, so we commit pull request again.
When parsing the `la` macro, we add a duplicate `$` prefix in
`getOrCreateSymbol`,
leading to `error: Undefined temporary symbol $$yy` for code like:
```
xx:
la $2,$yy
$yy:
nop
```
Remove the duplicate prefix.
In addition, recognize `.L`-prefixed symbols as local for O32.
See: #65020.
---------
Co-authored-by: Fangrui Song <i@maskray.me>
- Enable equivalent between `brcond` and `G_BRCOND`.
- Remove the manual selection of `G_BRCOND` in Mips. Revise test cases.
Reviewers: petar-avramovic, bcardosolopes, arsenm
Reviewed By: arsenm
Pull Request: https://github.com/llvm/llvm-project/pull/81306
FastISel may create a redundant BGTZ terminal which fallthroughes.
```
BGTZ %2:gpr32, %bb.1, implicit-def $at
bb.1.bb1:
; predecessors: %bb.0
```
The `!I->isBarrier()` check in
MipsAsmPrinter::isBlockOnlyReachableByFallthrough
will incorrectly not print a label, leading to a `Undefined temporary
symbol `
error when we try assembling the output assembly file. See the updated
`Fast-ISel/pr40325.ll` and
https://github.com/rust-lang/rust/issues/108835
In addition, the `SwitchInst` condition is too conservative and prints
many unneeded labels (see the updated tests).
Just use the generic isBlockOnlyReachableByFallthrough, updated by
commit 1995b9fead62f2f6c0ad217bd00ce3184f741fdb for SPARC, which also
handles MIPS.
* Placing 'G' before 'M' (SHF_MERGE) can be misleading as the sh_entsize
argument goes before the section group name, if a reader doesn't know
that the order of extra arguments is not affected by the order of flags.
* 'a', 'w', and 'x' indicate basic permission-related flags. Separating
them with 'G' is kinda ugly.
Simplify code and move 'G' after 'o'. The new output is more similar to
GCC.
When both SHF_LINK_ORDER | SHF_GROUP flags are set, GNU assembler from
2.35 onwards (https://sourceware.org/PR25381https://sourceware.org/binutils/docs/as/Section.html) parses the
SHF_LINK_ORDER argument before section group name, different from us.
This is unfortunate, but does not matter because the `.section` flag `o`
is a niche feature only used by compiler instrumentations, not adopted
by hand-written assembly, and using both flags is extremely rare. Let's
just match GNU assembler. There is another benefit: we now support
zero-flag section group with the SHF_LINK_ORDER flag, while previously
there isn't a syntax.
While here, print 'G' after 'o' to be clear that the 'G' argument is
parsed after the 'o' argument. To make the diff smaller, we don't print
'G' after 'w' in the absence of 'o' for now.
Do optimization to turn x >> (shift & 31/63) into a single srlv instead
of andi + srlv, since the mips variable shift instruction already
implicitly masks the shift, like x86, wasm and AMDGPU. Copy the
X86DAGToDAGISel::isUnneededShiftMask() function to MIPS for checking
whether need combine two instructions to one.
This ensures we clip the index to be in bounds of the vector we are
inserting into. If the index is out of bounds the results of the insert
element is poison. If we don't clip the index we can write memory that
was not part of the original store.
Fixes#74248#75557.
This ensures we clip the index to be in bounds of the vector we are
inserting into. If the index is out of bounds the results of the insert
element is poison. If we don't clip the index we can write memory that
was not part of the original store.
Fixes#74248.
Sometimes, loads can appear in a loop after the LICM pass is executed
the final time. For example, ExpandMemCmp pass creates loads in a loop,
and one of the operands may be an invariant address.
This patch extends the pre-regalloc stage MachineLICM by allowing to
hoist invariant loads from loops that don't have any stores or calls
and allows load reorderings.
If we start with an i128 shift, the initial shift amount would usually
have zeros in bit 8 and above. xoring the shift amount with -1 will set
those upper bits to 1. If DAGCombiner is able to prove those bits are
now 1, then the shift that uses the xor will be replaced with undef.
Which we don't want.
Reduce the xor constant to VT.bits-1 where VT is half the size of the
larger shift type. This avoids toggling the upper bits. The hardware
shift instruction only uses the lower bits of the shift amount. I assume
the code used NOT because the hardware doesn't use the upper bits, but
that isn't compatible with the LLVM poison semantics.
Fixes#71142.
MipsIncomingValueHandler::assignCustomValue should return 1 instead of
2. The return value is the number of additional ArgLocs being consumed.
It's assumed that at least 1 is consumed.
Correct the LocVT used for the spill when there are no registers left.
It should be f64 instead of i32. This allows a workaround to be removed
in the SelectionDAG path.
BlockFrequencyInfo calculates block frequencies as Scaled64 numbers but as a last step converts them to unsigned 64bit integers (`BlockFrequency`). This improves the factors picked for this conversion so that:
* Avoid big numbers close to UINT64_MAX to avoid users overflowing/saturating when adding multiply frequencies together or when multiplying with integers. This leaves the topmost 10 bits unused to allow for some room.
* Spread the difference between hottest/coldest block as much as possible to increase precision.
* If the hot/cold spread cannot be represented loose precision at the lower end, but keep the frequencies at the upper end for hot blocks differentiable.
PR #66334 tried to renumber slot indexes before register allocation, but
the numbering was still affected by list entries for instructions which
had been erased. Fix this to make the register allocator's live range
length heuristics even less dependent on the history of how instructions
have been added to and removed from SlotIndexes's maps.
The legalizer currently generates lots of G_AND artifacts.
For example between boolean uses and defs there is always a G_AND with a mask of 1, but when the target uses ZeroOrOneBooleanContents, this is unnecessary.
Currently these artifacts have to be removed using post-legalize combines.
Omitting these artifacts at their source in the artifact combiner has a few advantages:
- We know that the emitted G_AND is very likely to be useless, so our KnownBits call is likely worth it.
- The G_AND and G_CONSTANT can interrupt e.g. G_UADDE/... sequences generated during legalization of wide adds which makes it harder to detect these sequences in the instruction selector (e.g. useful to prevent unnecessary reloading of AArch64 NZCV register).
- This cleans up a lot of legalizer output and even improves compilation-times.
AArch64 CTMark geomean: `O0` -5.6% size..text; `O0` and `O3` ~-0.9% compilation-time (instruction count).
Since this introduces KnownBits into code-paths used by `O0`, I reduced the default recursion depth.
This doesn't seem to make a difference in CTMark, but should prevent excessive recursive calls in the worst case.
Reviewed By: aemerson
Differential Revision: https://reviews.llvm.org/D159140
The issue is uncovered by #47698: for IR files without a target triple,
-mtriple= specifies the full target triple while -march= merely sets the
architecture part of the default target triple, leaving a target triple which
may not make sense, e.g. riscv64-apple-darwin.
Therefore, -march= is error-prone and not recommended for tests without a target
triple. The issue has been benign as we recognize $unknown-apple-darwin as ELF instead
of rejecting it outrightly.
Apparently the testcase
coalesce-partial-redundant-reguse-terminator.mir
was broken in a way that -verify-coalescing detected.
Update the testcase so -verify-coalescing doesn't complain and so
that it still exposes the problem originally fixed in 6c062b7641623.
Differential Revision: https://reviews.llvm.org/D158397
This modifies the G_UADDE legalizaton to a version that looks shorter
on Mips and RISC-V when feeding the equivalent IR to SelectionDAG.
This also removes the boolean select from G_USUBE.
Comments taken from LegalizeDAG and tweaked.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D158232
If carryin was 1, and RHS is 0xffffffff we were not giving a carry
out.
In that case Res would be equal to LHS, so Res <u LHS would be false.
But there should be a carry out since carryin+RHS wraps around to 0.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D157943
D52985/D57677 added a .gcc_except_table workaround, but the new behavior
doesn't match GNU assembler.
```
void foo();
int bar() {
foo();
try { throw 1; }
catch (int) { return 1; }
return 0;
}
clang --target=mipsel-linux-gnu -mmicromips -S a.cc
mipsel-linux-gnu-gcc -mmicromips -c a.s -o gnu.o
.uleb128 ($cst_end0)-($cst_begin0) // bit 0 is not forced to 1
.uleb128 ($func_begin0)-($func_begin0) // bit 0 is not forced to 1
```
I have inspected `.gcc_except_table` output by `mipsel-linux-gnu-gcc -mmicromips -c a.cc`.
The `.uleb128` values are not forced to set the least significant bit.
In addition, D57677's adjustment (even->odd) to CodeGen/Mips/micromips-b-range.ll is wrong.
PC-relative `.long func - .` values will differ from GNU assembler as well.
The original intention of D52985 seems unclear to me. I think whatever
goal it wants to achieve should be moved to an upper layer.
This isMicroMips special case has caused problems to fix MCAssembler::relaxLEB to use evaluateAsAbsolute instead of evaluateKnownAbsolute,
which is needed to proper support R_RISCV_SET_ULEB128/R_RISCV_SUB_ULEB128.
Differential Revision: https://reviews.llvm.org/D157655
Checks for implicit defs that are unused within a pattern and mark them as dead.
This is done directly at the TableGen level forr efficiency.
The instructions are directly created with the "dead" operand and no further analysis is needed later.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D157273
This reverts commit a496c8be6e638ae58bb45f13113dbe3a4b7b23fd.
The workaround in c26dfc81e254c78dc23579cf3d1336f77249e1f6 should work
around the underlying problem with SUBREG_TO_REG.
And dependent commits.
Details in D150388.
This reverts commit 825b7f0ca5f2211ec3c93139f98d1e24048c225c.
This reverts commit 7a98f084c4d121244ef7286bc6503b6a181d446e.
This reverts commit b4a62b1fa546312d882fa12dfdcd015177d66826.
This reverts commit b7836d856206ec39509d42529f958c920368166b.
No conflicts in the code, few tests had conflicts in autogenerated CHECKs:
llvm/test/CodeGen/Thumb2/mve-float32regloops.ll
llvm/test/CodeGen/AMDGPU/fix-frame-reg-in-custom-csr-spills.ll
Reviewed By: alexfh
Differential Revision: https://reviews.llvm.org/D156381
The Mips MSA ABI requires that legal vector types are passed in
scalar registers in packed representation. E.g. a type like v16i8
would be passed as two i64 registers.
The implementation attempts to do the same for illegal vectors with
non-power-of-two element counts or non-power-of-two element types.
However, the SDAG argument lowering code doesn't support this, and
it is not easy to extend it to support this (we would have to deal
with situations like passing v7i18 as two i64 values).
This patch instead opts to restrict the special argument lowering
to only vectors with power-of-two elements and round element types.
Everything else is lowered naively, that is by passing each element
in promoted registers.
Fixes https://github.com/llvm/llvm-project/issues/63608.
Differential Revision: https://reviews.llvm.org/D154445
We currently don't extract vector elements from multi-use build vectors unless TLI.aggressivelyPreferBuildVectorSources accepts them, which seems a little extreme for constant build vectors (especially as under some cases ComputeKnownBits will indirectly extract the data for us).
This is causing a few regressions in some upcoming SimplifyDemandedBits work I'm looking at, all of which just need to know that the element is zero, so I've tweaked the fold to accept zero elements as well, which will typically fold very easily.
Differential Revision: https://reviews.llvm.org/D155582
Set setMaxAtomicSizeInBitsSupported for Mips. Set the value as appropriate for 64-bit MIPS vs 32-bit.
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D141189
Replacing D143754. Right now the LiveRangeSplitting during register allocation uses
TargetOpcode::COPY instruction for splitting. For AMDGPU target that creates a
problem as we have both vector and scalar copies. Vector copies perform a copy over
a vector register but only on the lanes(threads) that are active. This is mostly sufficient
however we do run into cases when we have to copy the entire vector register and
not just active lane data. One major place where we need that is live range splitting.
Allowing targets to use their own copy instructions(if defined) will provide a lot of
flexibility and ease to lower these pseudo instructions to correct MIR.
- Introduce getTargetCopyOpcode() virtual function and use if to generate copy in Live range
splitting.
- Replace necessary MI.isCopy() checks with TII.isCopyInstr() in register allocator pipeline.
Reviewed By: arsenm, cdevadas, kparzysz
Differential Revision: https://reviews.llvm.org/D150388
If we have a store of a load with no other uses in between it, it's
considered dead and is removed. So sometimes when legalizing a fixed
length vector store of an insert, we end up producing better code
through scalarization than without.
An example is the follow below:
%a = load <4 x i64>, ptr %x
%b = insertelement <4 x i64> %a, i64 %y, i32 2
store <4 x i64> %b, ptr %x
If this is scalarized, then DAGCombine successfully removes 3 of the 4
stores which are considered dead, and on RISC-V we get:
sd a1, 16(a0)
However if we make the vector type legal (-mattr=+v), then we lose the
optimisation because we don't scalarize it.
This patch attempts to recover the optimisation for vectors by
identifying patterns where we store a load with a single insert
inbetween, replacing it with a scalar store of the inserted element.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D152276