Imagine a loop of the form:
```
preheader:
%r = def
header:
bcc latch, inner
inner1:
..
inner2:
b latch
latch:
%r = subs %r
bcc header
```
It can be possible for code to spend a decent amount of time in the
header<->latch loop, not going into the inner part of the loop as much.
The greedy register allocator can prefer to spill _around_ %r though,
adding spills around the subs in the loop, which can be very detrimental
for performance. (The case I am looking at is actually a very deeply
nested set of loops that repeat the header<->latch pattern at multiple
different levels).
The greedy RA will apply a preference to spill to the IV, as it is live
through the header block. This patch attempts to add a heuristic to
prevent that in this case for variables that look like IVs, in a similar
regard to the extra spill weight that gets added to variables that look
like IVs, that are expensive to spill. That will mean spills are more
likely to be pushed into the inner blocks, where they are less likely to
be executed and not as expensive as spills around the IV.
This gives a 8% speedup in the exchange benchmark from spec2017 when
compiled with flang-new, whilst importantly stabilising the scores to be
less chaotic to other changes. Running ctmark showed no difference in
the compile time. I've tried to run a range of benchmarking for
performance, most of which were relatively flat not showing many large
differences. One matrix multiply case improved 21.3% due to removing a
cascading chains of spills, and some other knock-on effects happen which
usually cause small differences in the scores.
Optimization reduces the range for switches whose cases are positive powers
of two by replacing each case with count_trailing_zero(case).
Resolves#70756
Remove s64 as a valid type for G_JUMP_TABLE since I think it is always a
pointer?
Replace custom predicate for G_BRJT with a legalFor that checks 2 types.
If we are using entirely vslide1downs to initialize an otherwise undef
vector, we end up with an implicit_def as the source of the first
vslide1down. This register has to be allocated, and creates false
dependencies with surrounding code.
Instead, start our sequence with a vmv.v.x in the hopes of creating a
dependency breaking idiom. Unfortunately, it's not clear this will
actually work as due to the VL=0 special case for T.A. the hardware has
to work pretty hard to recognize that the vmv.v.x actually has no source
dependence. I don't think we can reasonable expect all hardware to have
optimized this case, but I also don't see any downside in prefering it.
In openai/triton#2483 I've encountered a bug in the NVPTX codegen. Given
`load<8 x half>` followed by `fpext to <8 x float>` we get
```
ld.shared.v4.b16 {%f1, %f2, %f3, %f4}, [%r15+8];
ld.shared.v4.b16 {%f5, %f6, %f7, %f8}, [%r15];
```
Which loads float16 values into float registers without any conversion
and the result is simply garbage.
This PR brings `v8f16` and `v8bf16` into line with the other vector
types by expanding it to load + cvt.
cc @manman-ren @Artem-B @jlebar
If we already have a YMM/ZMM constant that a smaller XMM/YMM has matching lower bits, then ensure we reuse the same constant pool entry.
Extends the similar combines we already have to reuse VBROADCAST_LOAD/SUBV_BROADCAST_LOAD constant loads.
This is a mainly a canonicalization, but should make it easier for us to merge constant loads in a future commit (related to both #70947 and better X86FixupVectorConstantsPass usage for #71078).
Adds GraalVM calling conventions. The only difference with the default calling conventions is that GraalVM reserves two registers for the heap base and the thread. Since the registers are then accessed by name, getRegisterByName has to be updated accordingly.
This patch implements the calling conventions only for X86, AArch64 and RISC-V.
For X86, the reserved registers are X14 and X15. For AArch64, they are X27 and X28. For RISC-V, they are X23 and X27.
This patch has been used by the LLVM backend of GraalVM's Native Image project in production for around 4 months with no major issues.
Differential Revision: https://reviews.llvm.org/D151107
We check the opcode of the first instruction in the block where the
prologue is inserted without checking if the iterator points to any
instructions. When the basic block is empty, that causes a crash. One
way the prologue block can be empty is when it starts with a call to
__builtin_readcyclecounter on RV32 since that produces a loop.
Co-authored-by: Nemanja Ivanovic <nemanja@synopsys.com>
Allocate a register of the correct register class for inline asm
constraint "r" when used for FP values with -Zfinx/-Zdinx.
---------
Co-authored-by: Nemanja Ivanovic <nemanja@synopsys.com>
We already have the pseudo's for lowering these as MI nodes with
rounding mode operands, and the generic FRM insertion pass. Doing the
insertion later in the backend allows SSA level passes to avoid
reasoning about physical register copies, and happens to produce better
code in practice. The later is mostly an accident of our insertion
order; we happen to place the frm write after the vsetvli, and it's very
common for a register to be killed at the vsetvli. End result is that we
get slightly better scalar register allocation.
I'm a bit unclear on the history here. I was surprised to find this code
in ISEL lowering at all, but am also surprised once I found it that all
the patterns and pseudos seem to already exist. My best guess is that
maybe we didn't do all the possible cleanup after introducing the
HasRoundMode mechanism?
We were printing the entire Constant, which if we were loading from a wider constant pool entry meant that we were confusing the asm comment with upper bits that aren't actually part of the load result
[X86]: Enable custom lowering for fabs.v8f16 on AVX
Currently, custom lowering of fabs.v8f16 requires AVX512FP16, which is
too restrictive. For v8f16 fabs lowering, no instructions in AVX512FP16
are needed. Without the fix, horribly inefficient code is generated
without AVX512FP16. Note instcombiner generates calls to intrinsics
@llvm.fabs.v8f16 when simplifyping AND <8 x half> operations.
For the following testcase:
undef %1.sub1:sgpr_96 = COPY undef %0:sgpr_32
%3:vgpr_32 = V_LSHL_ADD_U32_e64 %1.sub1:sgpr_96, ...
GCNRewritePartialRegUses produced:
%4:vgpr_32 = COPY undef %1:sgpr_32
dead %2:vgpr_32 = V_LSHL_ADD_U32_e64 %4, ...
Register class for %4 is incorrect: there should be sgpr_32 instead of
vgpr_32 because the original %1 had scalar regclass. This patch fixes
that.
Note that GCNRewritePartialRegUses pass isn't enabled by default yet.
As explained in the comment in SIInstrInfo::FoldImmediate, if we have a
choice between v_madak_f32 and v_madmk_f32 we should choose the former
so that the literal that is not folded into the instruction can be
materialized in an sgpr instead of a vgpr.
Sometimes, loads can appear in a loop after the LICM pass is executed
the final time. For example, ExpandMemCmp pass creates loads in a loop,
and one of the operands may be an invariant address.
This patch extends the pre-regalloc stage MachineLICM by allowing to
hoist invariant loads from loops that don't have any stores or calls
and allows load reorderings.
We adjust the insertion point at the BB top for spills/copies during RA
to ensure they are placed after the exec restore instructions required
for the divergent control flow execution. This is, however, required
only for the vector operations. The insertions for scalar registers can
still go to the BB top.
All known linkers handle R_RISCV_CALL and R_RISCV_CALL_PLT in the same
way (GNU ld since
https://sourceware.org/pipermail/binutils/2020-August/112750.html).
MO_CALL is for R_RISCV_CALL, a deprecated relocation type. We don't
migrate away from MO_CALL yet.
For GISel we don't have the output difference concern and should weigh
more on simplicity.
Vector Register banks are created for the various register vector
register groupings. getRegBankFromRegClass is implemented to go from
vector TargetRegisterClass to the corresponding vector RegisterBank.
Use vmv_s_x instead of the constant will be materialized in a GPR.
This avoids going from GPR to FPR to vector.
We already did this for RISCVISD::VFMV_S_F_VL and probably we should
just turn int_riscv_vfmv_s_f into RISCVISD::VFMV_S_F_VL, but I'd like
to see some improvements to RISCVInsertVSETVLI first.
When merging lane masks, value from block that is always visited first
(PrevReg in buildMergeLaneMasks) needs to exist because we do on-the-fly
constant folding. For PrevReg to exist, basic block that should contain
PrevReg definition must be processed first. Sort the incomings such that
incoming values that dominate other incoming values are processed first.
Sorting of phi incomings makes no changes for phis created by SDAG
because SDAG adds phi incomings as it selects basic blocks in reversed
post order traversal.
This change is required by upcoming lane mask merging implementation
for GlobalISel that leaves phi incomings as they are in IR.
When x is not known to be nonzero, ctpop(x) == 1 is expanded to
x != 0 && (x & (x - 1)) == 0
resulting in codegen like
leal -1(%rdi), %eax
testl %eax, %edi
sete %cl
testl %edi, %edi
setne %al
andb %cl, %al
But another expression that works is
(x ^ (x - 1)) > x - 1
which has nicer codegen:
leal -1(%rdi), %eax
xorl %eax, %edi
cmpl %eax, %edi
seta %al