Previously reverted in 8b446ea2ba39e406bcf940ea35d6efb4bb9afe95
Reapplying because this commit is NOT DEPENDENT on the reverted commit
fc21f2d7bae2e0be630470cc7ca9323ed5859892, which broke the ASAN buildbot.
See https://reviews.llvm.org/rGfc21f2d7bae2e0be630470cc7ca9323ed5859892 for
more information.
The arguments to a PHI may represent a recurrence by eventually using the output
of the PHI itself. This is now handled by checking for cycles in the control
flow. If a PHI is not in a recurrence, it is now able to report multiple offsets
instead of conservatively reporting unknown.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D138991
This allows it to fold WHILEop with constant operand to PTRUE instruction in
the case given range is fitted to predicate format. Also, this change
fixes the unsigned overflow error introduced in D137547 for WHILELO lowering.
Differential Revision: https://reviews.llvm.org/D139068
sifive-7-series can predicate ALU instructions in the shadow of a
branch not just move instructions.
This patch implements analyzeSelect/optimizeSelect to predicate
these operations. This is based on ARM's implementation which can
predicate using flags and condition codes.
I've restricted it to just the instructions we have test cases for,
but it can be extended in the future.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D140053
Currently, the custom SGPR spill lowering pass spills
SGPRs into physical VGPR lanes and the remaining VGPRs
are used by regalloc for vector regclass allocation.
This imposes many restrictions that we ended up with
unsuccessful SGPR spilling when there won't be enough
VGPRs and we are forced to spill the leftover into
memory during PEI. The custom spill handling during PEI
has many edge cases and often breaks the compiler time
to time.
This patch implements spilling SGPRs into virtual VGPR
lanes. Since we now split the register allocation for
SGPRs and VGPRs, the virtual registers introduced for
the spill lanes would get allocated automatically in
the subsequent regalloc invocation for VGPRs.
Spill to virtual registers will always be successful,
even in the high-pressure situations, and hence it avoids
most of the edge cases during PEI. We are now left with
only the custom SGPR spills during PEI for special registers
like the frame pointer which isn an unproblematic case.
This patch also implements the whole wave spills which
might occur if RA spills any live range of virtual registers
involved in the whole wave operations. Earlier, we had
been hand-picking registers for such machine operands.
But now with SGPR spills into virtual VGPR lanes, we are
exposing them to the allocator.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D124196
Unlike the callee-saved VGPR spill instructions emitted by
`PEI::spillCalleeSavedRegs`, the CS VGPR spills inserted during
emitPrologue/emitEpilogue require the exec bits flipping to avoid
clobbering the inactive lanes of VGPRs used for SGPR spilling.
Currently, these spill instructions are referenced from the SP at
function entry and when the callee performs a stack realignment,
they ended up getting incorrect stack offsets. Even if we try to
adjust the offsets, the FP-SP becomes a runtime entity with dynamic
stack realignment and the offsets would still be inaccurate.
To fix it, use FP as the frame base in the spill instructions
whenever the function has FP. The offsets obtained for the CS
objects would always be the right values from FP.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D134949
In general, a callee is free to use a scratch register without
preserving its previous state. However, the VGPR used for SGPR
spilling can potentially have its inactive lanes overwritten by
the writelane instructions. When the function returns, it can
cause unexpected behavior if the VGPR value is not preserved
appropriately.
The current scheme to preserve the inactive lanes of such
scratch VGPRs is not done rightly. It preserves all lanes
and causes the outgoing values (if any) getting overwritten
by the epilog restores. It then corrupts the return value.
To avoid such situation with scratch VGPRs, this patch ensures
we preserve only their inactive lanes.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D134526
There is a lot of customization and eventually code duplication in the
frame lowering that handles special SGPR spills like the one needed for
the Frame Pointer. Incorporating any additional SGPR spill currently
makes it difficult during PEI. This patch introduces a new spill builder
to efficiently handle such spill requirements. Various spill methods are
special handled using a separate class.
Reviewed By: sebastian-ne, scott.linder
Differential Revision: https://reviews.llvm.org/D132436
SILowerSGPRSpills pass handles the lowering of SGPR spills
into VGPR lanes. Some SGPR spills are handled later during
PEI. There is a common function used in both places to find
the free VGPR lane. This patch eliminates that dependency to
find the free VGPR by handling it separately for PEI. It is a
prerequisite patch for a future work to allow SGPR spills to
virtual VGPR lanes during SILowerSGPRSpills.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D124195
We always assume the vector register is dead or killed while
inserting the VGPR spills in the prolog. It is not always
true. Used the entry block liveIn data while setting the flag.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D124194
Since the writelane instruction used for SGPR spills can
modify inactive lanes, the callee must preserve the VGPR
this instruction modifies even if it was marked Caller-saved.
Reviewed By: arsenm, nhaehnle
Differential Revision: https://reviews.llvm.org/D124192
Most of VP intrinsics are implemented in RISC-V backends, but
vp.reduce.mul (element length > 1) does not yet. Legalizes vp.reduce.mul
using ExpandVectorPredication Pass.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D139721
This reverts commit 37b8f09a4b61bf9bf9d0b9017d790c8b82be2e17,
and returns commit 1bd0b82e508d049efdb07f4f8a342f35818df341.
The miscompile was in InstCombine, and it has been addressed.
This tries to approach the problem noted by @arsenm:
terrible codegen for `__builtin_fpclassify()`:
https://godbolt.org/z/388zqdE37
Just because the PHI in the common successor happens to have different
incoming values for these two blocks, doesn't mean we have to give up.
It's quite easy to deal with this, we just need to produce a select:
https://alive2.llvm.org/ce/z/000srb
Now, the cost model for this transform is rather overly strict,
so this will basically never fire. We tally all (over all preds)
the selects needed to the NumBonusInsts
Differential Revision: https://reviews.llvm.org/D139275
Presently, there is an issue on MI100 (and probably other architecture) where the VGPR used for AGPR copies clobbers VGPR used for AGPR spill. AFAICT this is because in processFunctionBeforeFrameIndicesReplaced we think the VGPR register for AGPR spill is unused. This patch aims to correct that. This is a WIP while I work out issues with producing a good test. For now, I'm curious if this is generally a good / bad idea.
Differential Revision: https://reviews.llvm.org/D139673
* This is a recommit of 3c4d2a03968ccf5889bacffe02d6fa2443b0260f,
* which was reverted in 25f01d593ce296078f57e872778b77d074ae5888,
because it exposed a miscompile in PPC backend, which was resolved
in https://reviews.llvm.org/D140089 / cb3f415cd2019df7d14683842198bc4b7a492bc5.
* which was a recommit of cf624b23bc5d5a6161706d1663def49380ff816a,
* which was reverted in 5cfc22cafe3f2465e0bb324f8daba82ffcabd0df,
because the cut-off on the number of vector elements was not low enough,
and it triggered both SDAG SDNode operand number assertions,
5and caused compile time explosions in some cases.
Let's try with something really *REALLY* conservative first,
just to get somewhere, and try to bump it later.
FIXME: should this respect TTI reg width * num vec regs?
Original commit message:
Now, there's a big caveat here - these bytes
are abstract bytes, not the i8 we have in LLVM,
so strictly speaking this is not exactly legal,
see e.g. https://github.com/AliveToolkit/alive2/issues/860
^ the "bytes" "could" have been a pointer,
and loading it as an integer inserts an implicit ptrtoint.
But at the same time,
InstCombine's `InstCombinerImpl::SimplifyAnyMemTransfer()`
would expand a memtransfer of 1/2/4/8 bytes
into integer-typed load+store,
so this isn't exactly a new problem.
Note that in memory, poison is byte-wise,
so we really can't widen elements,
but SROA seems to be inconsistent here.
Fixes#59116.
The combiner for BUILD_VECTOR that merges consecutive
loads into a wide load had two issues:
- It didn't check that the input loads all have the
same input chain
- It didn't update nodes that are chained to the original
loads to be chained to the new load
This caused issues with bootstrap when
3c4d2a03968ccf5889bacffe02d6fa2443b0260f was committed.
This patch fixes the issue so it can unblock this commit.
Differential revision: https://reviews.llvm.org/D140046
This change:
- Modifies the ACLE code to allow the new SLC value (3) for the prefetch
target.
- Introduces a new intrinsic, @llvm.aarch64.prefetch which matches the
PRFM family instructions much more closely, and can represent all
values for the PRFM immediate.
The target-independent @llvm.prefetch intrinsic does not have enough
information for us to be able to lower to it from the ACLE intrinsics
correctly.
- Lowers the acle calls to the new intrinsic on aarch64 (the ARM
lowering is unchanged).
- Implements code generation for the new intrinsic in both SelectionDAG
and GlobalISel. We specifically choose to continue to support lowering
the target-independent @llvm.prefetch intrinsic so that other
frontends can continue to use it.
Differential Revision: https://reviews.llvm.org/D139443
Other targets like ARM, AArch64, RISCV and X86 have similar tests.
`O1`, `O2` and `O3` appear to be the same for now. But in future, some
passes may be disabled at lower levels (e.g. `O1`). Hoping we can use
FileCheck prefixes for differences to avoid repeating the contents 3
times.
Reviewed By: xen0n, MaskRay
Differential Revision: https://reviews.llvm.org/D139499
We should probably handle any 32-bit type here, but the intrinsic
definition and selection pattern currently do not. Avoids a few lit
tests failures when switched on by default.
Sometimes we have a constant value loaded to VGPR. In case we further
need to rematrerialize it in the physical scalar register we may avoid VGPR to
SGPR copy replacing it with S_MOV_B32.
Reviewed By: JonChesterfield, arsenm
Differential Revision: https://reviews.llvm.org/D139874
We are in the process of retiring LLVM_HAVE_TF_API in favor of
LLVM_HAVE_TFLITE. This patch takes care of the transition in
llvm/test.
Differential Revision: https://reviews.llvm.org/D140133
Adds support for i64 constant. It uses the same pattern-based
approach as in SDAG (see PPCISelDAGToDAG::selectI64ImmDirect(),
PPCISelDAGToDAG::selectI64Imm()). It does not support the
prefixed instructions.
Reviewed By: arsenm, tschuett
Differential Revision: https://reviews.llvm.org/D140119
SelectionDAG aggressively creates sext_inreg operations after
promoting an i32 add. If the add is later matched to a sh1add,
sh2add or sh3add, a sext.w from the sext_inreg will get left behind.
In many cases we can prove this sext.w is unnecessary by checking
if its upper bits are ever used.
Changing cast_lds_gv into a kernel function to
lower the LDS usage appropriately. The LDS lowering
is currently won't happen for orphan device functions.
See if we can freely sign-extend both sources of a vselect operand, also handle allones constant build vectors (easily rematerializable and uses in the test case).
Fixes#59526
This extends the backwards walk to allow mutating the previous vsetvl's AVL value if it was not used by any instructions in between. In practice, this mostly benefits vmv.x.s and fvmv.f.s patterns since vector instructions which ignore VL are rare.
Differential Revision: https://reviews.llvm.org/D140048
[AArch64] Patch for lowering trunc instructions to 'tbl' for (8|16)xi32 -> (8|16)xi8 conversions in https://reviews.llvm.org/D133495 is extended to support trunc to tbl lowering for (8|16) x i64 to (8|16) x i8.
A microbenchmark for runtime for these transformations is added in https://reviews.llvm.org/D136274
Reviewed by: fhahn, t.p.northover
Differential Revision: https://reviews.llvm.org/D135229
These tests show code generation for vectorized trunc lowering from i16 to i8 in AArch64.
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D137293
Alignment of an alloca in IR can be lower than the preferred alignment
on purpose, but this override essentially treats the preferred
alignment as the minimum alignment.
The patch changes this behavior to always use the specified
alignment. If alignment is not set explicitly in LLVM IR, it is set to
DL.getPrefTypeAlign(Ty) in computeAllocaDefaultAlign.
Tests are changed as well: explicit alignment is increased to match
the preferred alignment if it changes output, or omitted when it is
hard to determine the right value (e.g. for pointers, some structs, or
weird types).
Differential Revision: https://reviews.llvm.org/D135462
Reland with a fixup to avoid converting APInts to int64_t which allowed for
overflows (UB) with sufficiently high/low multiplier values.
This allows DemandedBits to see the result of VSCALE will be at most
VScaleMax * some compile-time constant. This relies on the vscale_range()
attribute being present on the function, with a max set. (This is done by
default when clang is targeting AArch64+SVE).
Using this various redundant operations (zexts, sexts, ands, ors, etc)
can be eliminated.
Differential Revision: https://reviews.llvm.org/D138508
This continues the refactoring work of selecting offset + address
operands with the AddrOpsN pattern, previously called LoadOpsN.
This is not an NFC, since constant addresses are now folded into the
offset in more places for v128.storeN_lane.
Differential Revision: https://reviews.llvm.org/D139950
If the divisor is repeated at least twice, we will convert the FDIVs to the
calculation of the reciprocal and FMULs.
We perform the transformation only under fast-math mode. FDIVs must have
'arcp' flag.
Differential Revision: https://reviews.llvm.org/D140024
When we use llc or lld to compiler IR files, the features +nan2008 and +fpxx/+fp64 are not used.
Thus wrong format files are produced.
In IR files, the attributes are only set for function while not the whole compile units.
So we output `.nan 2008` and `.module fp=xx/64` before every function.
`isFPXXDefault`: for o32, the FPXX should always be the default, no matter about the vendors.
Of course some distributions with FP64 default enabled should be listed explicit.
Let's add them in future if we know about one.
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D138179
Summary: Currently we get a wrong fixed value for R_RBR relocations when -ffunction-sections enabled. This patch fixes this.
Reviewed By: DiggerLin, shchenz
Differential Revision: https://reviews.llvm.org/D138982