This ensures we clip the index to be in bounds of the vector we are
inserting into. If the index is out of bounds the results of the insert
element is poison. If we don't clip the index we can write memory that
was not part of the original store.
Fixes#74248#75557.
This ensures we clip the index to be in bounds of the vector we are
inserting into. If the index is out of bounds the results of the insert
element is poison. If we don't clip the index we can write memory that
was not part of the original store.
Fixes#74248.
Sometimes, loads can appear in a loop after the LICM pass is executed
the final time. For example, ExpandMemCmp pass creates loads in a loop,
and one of the operands may be an invariant address.
This patch extends the pre-regalloc stage MachineLICM by allowing to
hoist invariant loads from loops that don't have any stores or calls
and allows load reorderings.
If we start with an i128 shift, the initial shift amount would usually
have zeros in bit 8 and above. xoring the shift amount with -1 will set
those upper bits to 1. If DAGCombiner is able to prove those bits are
now 1, then the shift that uses the xor will be replaced with undef.
Which we don't want.
Reduce the xor constant to VT.bits-1 where VT is half the size of the
larger shift type. This avoids toggling the upper bits. The hardware
shift instruction only uses the lower bits of the shift amount. I assume
the code used NOT because the hardware doesn't use the upper bits, but
that isn't compatible with the LLVM poison semantics.
Fixes#71142.
MipsIncomingValueHandler::assignCustomValue should return 1 instead of
2. The return value is the number of additional ArgLocs being consumed.
It's assumed that at least 1 is consumed.
Correct the LocVT used for the spill when there are no registers left.
It should be f64 instead of i32. This allows a workaround to be removed
in the SelectionDAG path.
BlockFrequencyInfo calculates block frequencies as Scaled64 numbers but as a last step converts them to unsigned 64bit integers (`BlockFrequency`). This improves the factors picked for this conversion so that:
* Avoid big numbers close to UINT64_MAX to avoid users overflowing/saturating when adding multiply frequencies together or when multiplying with integers. This leaves the topmost 10 bits unused to allow for some room.
* Spread the difference between hottest/coldest block as much as possible to increase precision.
* If the hot/cold spread cannot be represented loose precision at the lower end, but keep the frequencies at the upper end for hot blocks differentiable.
PR #66334 tried to renumber slot indexes before register allocation, but
the numbering was still affected by list entries for instructions which
had been erased. Fix this to make the register allocator's live range
length heuristics even less dependent on the history of how instructions
have been added to and removed from SlotIndexes's maps.
The legalizer currently generates lots of G_AND artifacts.
For example between boolean uses and defs there is always a G_AND with a mask of 1, but when the target uses ZeroOrOneBooleanContents, this is unnecessary.
Currently these artifacts have to be removed using post-legalize combines.
Omitting these artifacts at their source in the artifact combiner has a few advantages:
- We know that the emitted G_AND is very likely to be useless, so our KnownBits call is likely worth it.
- The G_AND and G_CONSTANT can interrupt e.g. G_UADDE/... sequences generated during legalization of wide adds which makes it harder to detect these sequences in the instruction selector (e.g. useful to prevent unnecessary reloading of AArch64 NZCV register).
- This cleans up a lot of legalizer output and even improves compilation-times.
AArch64 CTMark geomean: `O0` -5.6% size..text; `O0` and `O3` ~-0.9% compilation-time (instruction count).
Since this introduces KnownBits into code-paths used by `O0`, I reduced the default recursion depth.
This doesn't seem to make a difference in CTMark, but should prevent excessive recursive calls in the worst case.
Reviewed By: aemerson
Differential Revision: https://reviews.llvm.org/D159140
The issue is uncovered by #47698: for IR files without a target triple,
-mtriple= specifies the full target triple while -march= merely sets the
architecture part of the default target triple, leaving a target triple which
may not make sense, e.g. riscv64-apple-darwin.
Therefore, -march= is error-prone and not recommended for tests without a target
triple. The issue has been benign as we recognize $unknown-apple-darwin as ELF instead
of rejecting it outrightly.
Apparently the testcase
coalesce-partial-redundant-reguse-terminator.mir
was broken in a way that -verify-coalescing detected.
Update the testcase so -verify-coalescing doesn't complain and so
that it still exposes the problem originally fixed in 6c062b7641623.
Differential Revision: https://reviews.llvm.org/D158397
This modifies the G_UADDE legalizaton to a version that looks shorter
on Mips and RISC-V when feeding the equivalent IR to SelectionDAG.
This also removes the boolean select from G_USUBE.
Comments taken from LegalizeDAG and tweaked.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D158232
If carryin was 1, and RHS is 0xffffffff we were not giving a carry
out.
In that case Res would be equal to LHS, so Res <u LHS would be false.
But there should be a carry out since carryin+RHS wraps around to 0.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D157943
D52985/D57677 added a .gcc_except_table workaround, but the new behavior
doesn't match GNU assembler.
```
void foo();
int bar() {
foo();
try { throw 1; }
catch (int) { return 1; }
return 0;
}
clang --target=mipsel-linux-gnu -mmicromips -S a.cc
mipsel-linux-gnu-gcc -mmicromips -c a.s -o gnu.o
.uleb128 ($cst_end0)-($cst_begin0) // bit 0 is not forced to 1
.uleb128 ($func_begin0)-($func_begin0) // bit 0 is not forced to 1
```
I have inspected `.gcc_except_table` output by `mipsel-linux-gnu-gcc -mmicromips -c a.cc`.
The `.uleb128` values are not forced to set the least significant bit.
In addition, D57677's adjustment (even->odd) to CodeGen/Mips/micromips-b-range.ll is wrong.
PC-relative `.long func - .` values will differ from GNU assembler as well.
The original intention of D52985 seems unclear to me. I think whatever
goal it wants to achieve should be moved to an upper layer.
This isMicroMips special case has caused problems to fix MCAssembler::relaxLEB to use evaluateAsAbsolute instead of evaluateKnownAbsolute,
which is needed to proper support R_RISCV_SET_ULEB128/R_RISCV_SUB_ULEB128.
Differential Revision: https://reviews.llvm.org/D157655
Checks for implicit defs that are unused within a pattern and mark them as dead.
This is done directly at the TableGen level forr efficiency.
The instructions are directly created with the "dead" operand and no further analysis is needed later.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D157273
This reverts commit a496c8be6e638ae58bb45f13113dbe3a4b7b23fd.
The workaround in c26dfc81e254c78dc23579cf3d1336f77249e1f6 should work
around the underlying problem with SUBREG_TO_REG.
And dependent commits.
Details in D150388.
This reverts commit 825b7f0ca5f2211ec3c93139f98d1e24048c225c.
This reverts commit 7a98f084c4d121244ef7286bc6503b6a181d446e.
This reverts commit b4a62b1fa546312d882fa12dfdcd015177d66826.
This reverts commit b7836d856206ec39509d42529f958c920368166b.
No conflicts in the code, few tests had conflicts in autogenerated CHECKs:
llvm/test/CodeGen/Thumb2/mve-float32regloops.ll
llvm/test/CodeGen/AMDGPU/fix-frame-reg-in-custom-csr-spills.ll
Reviewed By: alexfh
Differential Revision: https://reviews.llvm.org/D156381
The Mips MSA ABI requires that legal vector types are passed in
scalar registers in packed representation. E.g. a type like v16i8
would be passed as two i64 registers.
The implementation attempts to do the same for illegal vectors with
non-power-of-two element counts or non-power-of-two element types.
However, the SDAG argument lowering code doesn't support this, and
it is not easy to extend it to support this (we would have to deal
with situations like passing v7i18 as two i64 values).
This patch instead opts to restrict the special argument lowering
to only vectors with power-of-two elements and round element types.
Everything else is lowered naively, that is by passing each element
in promoted registers.
Fixes https://github.com/llvm/llvm-project/issues/63608.
Differential Revision: https://reviews.llvm.org/D154445
We currently don't extract vector elements from multi-use build vectors unless TLI.aggressivelyPreferBuildVectorSources accepts them, which seems a little extreme for constant build vectors (especially as under some cases ComputeKnownBits will indirectly extract the data for us).
This is causing a few regressions in some upcoming SimplifyDemandedBits work I'm looking at, all of which just need to know that the element is zero, so I've tweaked the fold to accept zero elements as well, which will typically fold very easily.
Differential Revision: https://reviews.llvm.org/D155582
Set setMaxAtomicSizeInBitsSupported for Mips. Set the value as appropriate for 64-bit MIPS vs 32-bit.
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D141189
Replacing D143754. Right now the LiveRangeSplitting during register allocation uses
TargetOpcode::COPY instruction for splitting. For AMDGPU target that creates a
problem as we have both vector and scalar copies. Vector copies perform a copy over
a vector register but only on the lanes(threads) that are active. This is mostly sufficient
however we do run into cases when we have to copy the entire vector register and
not just active lane data. One major place where we need that is live range splitting.
Allowing targets to use their own copy instructions(if defined) will provide a lot of
flexibility and ease to lower these pseudo instructions to correct MIR.
- Introduce getTargetCopyOpcode() virtual function and use if to generate copy in Live range
splitting.
- Replace necessary MI.isCopy() checks with TII.isCopyInstr() in register allocator pipeline.
Reviewed By: arsenm, cdevadas, kparzysz
Differential Revision: https://reviews.llvm.org/D150388
If we have a store of a load with no other uses in between it, it's
considered dead and is removed. So sometimes when legalizing a fixed
length vector store of an insert, we end up producing better code
through scalarization than without.
An example is the follow below:
%a = load <4 x i64>, ptr %x
%b = insertelement <4 x i64> %a, i64 %y, i32 2
store <4 x i64> %b, ptr %x
If this is scalarized, then DAGCombine successfully removes 3 of the 4
stores which are considered dead, and on RISC-V we get:
sd a1, 16(a0)
However if we make the vector type legal (-mattr=+v), then we lose the
optimisation because we don't scalarize it.
This patch attempts to recover the optimisation for vectors by
identifying patterns where we store a load with a single insert
inbetween, replacing it with a scalar store of the inserted element.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D152276
AMDGPU has native instructions and target intrinsics for this, but
these really should be subject to legalization and generic
optimizations. This will enable legalization of f16->f32 on targets
without f16 support.
Implement a somewhat horrible inline expansion for targets without
libcall support. This could be better if we could introduce control
flow (GlobalISel version not yet implemented). Support for strictfp
legalization is less complete but works for the simple cases.
This is a follow-up to b71edfaa4ec3c998aadb35255ce2f60bba2940b0
since I forgot the lit.local.cfg files in that one.
Reformatting is done with `black`.
If you end up having problems merging this commit because you
have made changes to a python file, the best way to handle that
is to run git checkout --ours <yourfile> and then reformat it
with black.
If you run into any problems, post to discourse about it and
we will try to help.
RFC Thread below:
https://discourse.llvm.org/t/rfc-document-and-standardize-python-code-style
Reviewed By: barannikov88, kwk
Differential Revision: https://reviews.llvm.org/D150762
[DAGCombiner] handle more store value forwarding
When lowering calls on target like PPC, some stack loads
will be generated for by value parameters. Node CALLSEQ_START
prevents such loads from being combined.
Suggested by @RolandF, this patch removes the unnecessary
loads for the byval parameter by extending ForwardStoreValueToDirectLoad
Reviewed By: nemanjai, RolandF
Differential Revision: https://reviews.llvm.org/D138899
Default MaxDivRemBitWidthSupported to 128, so that divisions larger
than 128 bits are always expanded, without requiring additional
configuration from the target.
Note that this may still emit calls to __udivti3 on 32-bit targets,
which likely don't have an implementation of that builtin. However,
I believe this is sufficient to fix
https://github.com/llvm/llvm-project/issues/60531, because Zig must
already be defining those builtins.
Differential Revision: https://reviews.llvm.org/D144871
Alignment of an alloca in IR can be lower than the preferred alignment
on purpose, but this override essentially treats the preferred
alignment as the minimum alignment.
The patch changes this behavior to always use the specified
alignment. If alignment is not set explicitly in LLVM IR, it is set to
DL.getPrefTypeAlign(Ty) in computeAllocaDefaultAlign.
Tests are changed as well: explicit alignment is increased to match
the preferred alignment if it changes output, or omitted when it is
hard to determine the right value (e.g. for pointers, some structs, or
weird types).
Differential Revision: https://reviews.llvm.org/D135462
When we use llc or lld to compiler IR files, the features +nan2008 and +fpxx/+fp64 are not used.
Thus wrong format files are produced.
In IR files, the attributes are only set for function while not the whole compile units.
So we extract the attributes from the first function and use it for the whole unit.
isFPXXDefault: for o32, the FPXX should always be the default, no matter about the vendors.
Of course some distributions with FP64 default enabled should be listed explicit.
Let's add them in future if we know about one.
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D140270
When lowering calls on target like PPC, some stack loads
will be generated for by value parameters. Node CALLSEQ_START
prevents such loads from being combined.
Suggested by @RolandF, this patch removes the unnecessary
loads for the byval parameter by extending ForwardStoreValueToDirectLoad
Reviewed By: nemanjai, RolandF
Differential Revision: https://reviews.llvm.org/D138899
https://reviews.llvm.org/D140493 is going to teach SROA how to promote allocas
that have variably-indexed loads. That does bring up questions of cost model,
since that requires creating wide shifts.
Indeed, our legalization for them is not optimal.
We either split it into parts, or lower it into a libcall.
But if the shift amount is by a multiple of CHAR_BIT,
we can also legalize it throught stack.
The basic idea is very simple:
1. Get a stack slot 2x the width of the shift type
2. store the value we are shifting into one half of the slot
3. pad the other half of the slot. for logical shifts, with zero, for arithmetic shift with signbit
4. index into the slot (starting from the base half into which we spilled, either upwards or downwards)
5. load
6. split loaded integer
This works for both little-endian and big-endian machines:
https://alive2.llvm.org/ce/z/YNVwd5
And better yet, if the original shift amount was not a multiple of CHAR_BIT,
we can just shift by that remainder afterwards: https://alive2.llvm.org/ce/z/pz5G-K
I think, if we are going perform shift->shift-by-parts expansion more than once,
we should instead go through stack, which is what this patch does.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D140638
I'm not sure why, but the absence of bitcasts / no-op GEPs causes
the branch delay slot to be used.
Differential Revision: https://reviews.llvm.org/D141593
This reverts commit 9739bb81aed490bfcbcbbac6970da8fb7232fd34.
It causes `.module is not permitted after generating code`
for Linux kernel's `ARCH=mips 32r1_defconfig` clang+GNU as build.
It's confirmed as a defect, but the proper fix needs time to sort out.