52796 Commits

Author SHA1 Message Date
Craig Topper
20f544d047
[RISCV][GISel] Instruction selection for G_JUMP_TABLE and G_BRJT. (#71987) 2023-11-18 12:33:25 -08:00
Craig Topper
8ad4df8327 [RISCV][GISel] Add s32 G_SELECT instruction select test for RV64. NFC 2023-11-18 12:29:04 -08:00
David Green
396e650ef3 [AArch64] Add some testing for BE shuffles. NFC 2023-11-18 20:09:58 +00:00
Craig Topper
0154e53bf3 [RISCV][GISel] Remove the rv32/rv64 subdirectories for legalizer tests. NFC
Add -rv32 -rv64 as suffix to test name. First step towards trying
to merge the content of these tests.
2023-11-18 11:25:09 -08:00
David Green
303a7835ff
[GreedyRA] Improve RA for nested loop induction variables (#72093)
Imagine a loop of the form:
```
  preheader:
    %r = def
  header:
    bcc latch, inner
  inner1:
    ..
  inner2:
    b latch
  latch:
    %r = subs %r
    bcc header
```

It can be possible for code to spend a decent amount of time in the
header<->latch loop, not going into the inner part of the loop as much.
The greedy register allocator can prefer to spill _around_ %r though,
adding spills around the subs in the loop, which can be very detrimental
for performance. (The case I am looking at is actually a very deeply
nested set of loops that repeat the header<->latch pattern at multiple
different levels).

The greedy RA will apply a preference to spill to the IV, as it is live
through the header block. This patch attempts to add a heuristic to
prevent that in this case for variables that look like IVs, in a similar
regard to the extra spill weight that gets added to variables that look
like IVs, that are expensive to spill. That will mean spills are more
likely to be pushed into the inner blocks, where they are less likely to
be executed and not as expensive as spills around the IV.

This gives a 8% speedup in the exchange benchmark from spec2017 when
compiled with flang-new, whilst importantly stabilising the scores to be
less chaotic to other changes. Running ctmark showed no difference in
the compile time. I've tried to run a range of benchmarking for
performance, most of which were relatively flat not showing many large
differences. One matrix multiply case improved 21.3% due to removing a
cascading chains of spills, and some other knock-on effects happen which
usually cause small differences in the scores.
2023-11-18 09:55:19 +00:00
Daniil
424c4249cc
[SimplifyCFG] Add optimization for switches of powers of two (#70977)
Optimization reduces the range for switches whose cases are positive powers
of two by replacing each case with count_trailing_zero(case).

Resolves #70756
2023-11-18 15:14:14 +08:00
Craig Topper
35ad44ebe4 [RISCV][GISel] Allow G_SELECT to have s32 type on RV64. 2023-11-17 17:12:27 -08:00
Craig Topper
d5ab48e583
[AArch64] Simplify legalizer info for G_JUMP_TABLE and G_BRJT. (#71962)
Remove s64 as a valid type for G_JUMP_TABLE since I think it is always a
pointer?

Replace custom predicate for G_BRJT with a legalFor that checks 2 types.
2023-11-17 16:44:24 -08:00
Arthur Eubanks
635756e4f3
[X86] Place data in large sections for large code model (#70265)
This allows better interoperability mixing small/medium/large code model
code since large code model data can be put into separate large
sections.

And respect large data threshold under large code model.
gcc also does this: https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html.

See https://groups.google.com/g/x86-64-abi/c/jnQdJeabxiU.
2023-11-17 15:47:28 -08:00
Philip Reames
144b2f579e
[RISCV] Start vslide1down sequence with a dependency breaking splat (#72691)
If we are using entirely vslide1downs to initialize an otherwise undef
vector, we end up with an implicit_def as the source of the first
vslide1down. This register has to be allocated, and creates false
dependencies with surrounding code.

Instead, start our sequence with a vmv.v.x in the hopes of creating a
dependency breaking idiom. Unfortunately, it's not clear this will
actually work as due to the VL=0 special case for T.A. the hardware has
to work pretty hard to recognize that the vmv.v.x actually has no source
dependence. I don't think we can reasonable expect all hardware to have
optimized this case, but I also don't see any downside in prefering it.
2023-11-17 12:02:58 -08:00
peterbell10
4263b2ecf8
[NVPTX] Expand EXTLOAD for v8f16 and v8bf16 (#72672)
In openai/triton#2483 I've encountered a bug in the NVPTX codegen. Given
`load<8 x half>` followed by `fpext to <8 x float>` we get

```
ld.shared.v4.b16 	{%f1, %f2, %f3, %f4}, [%r15+8];
ld.shared.v4.b16 	{%f5, %f6, %f7, %f8}, [%r15];
```

Which loads float16 values into float registers without any conversion
and the result is simply garbage.

This PR brings `v8f16` and `v8bf16` into line with the other vector
types by expanding it to load + cvt.

cc @manman-ren @Artem-B @jlebar
2023-11-17 09:51:50 -08:00
Simon Pilgrim
bfbfd1caa4 [X86] combineLoad - try to reuse existing constant pool entries for smaller vector constant data
If we already have a YMM/ZMM constant that a smaller XMM/YMM has matching lower bits, then ensure we reuse the same constant pool entry.

Extends the similar combines we already have to reuse VBROADCAST_LOAD/SUBV_BROADCAST_LOAD constant loads.

This is a mainly a canonicalization, but should make it easier for us to merge constant loads in a future commit (related to both #70947 and better X86FixupVectorConstantsPass usage for #71078).
2023-11-17 17:48:37 +00:00
Stanislav Mekhanoshin
7057f8f676
[AMDGPU] Pre-commit fdot2 test. NFC. (#72622)
This test exposes a bug where we violate constant bus restriction.
2023-11-17 09:32:38 -08:00
Antonio Frighetto
88d0ceb689 [AArch64] Additional test coverage for PR67879 (NFC)
Introduce further test exercizing `isAArch64FrameOffsetLegal`.
2023-11-17 17:32:50 +01:00
Sacha Coppey
aeedc07637 [IR] Add GraalVM calling conventions
Adds GraalVM calling conventions. The only difference with the default calling conventions is that GraalVM reserves two registers for the heap base and the thread. Since the registers are then accessed by name, getRegisterByName has to be updated accordingly.

This patch implements the calling conventions only for X86, AArch64 and RISC-V.

For X86, the reserved registers are X14 and X15. For AArch64, they are X27 and X28. For RISC-V, they are X23 and X27.

This patch has been used by the LLVM backend of GraalVM's Native Image project in production for around 4 months with no major issues.

Differential Revision: https://reviews.llvm.org/D151107
2023-11-17 16:30:09 +00:00
Nemanja Ivanovic
227607190e
[RISCV] Fix crash in PEI with empty entry block with Zcmp (#72117)
We check the opcode of the first instruction in the block where the
prologue is inserted without checking if the iterator points to any
instructions. When the basic block is empty, that causes a crash. One
way the prologue block can be empty is when it starts with a call to
__builtin_readcyclecounter on RV32 since that produces a loop.

Co-authored-by: Nemanja Ivanovic <nemanja@synopsys.com>
2023-11-17 16:18:44 +01:00
Nemanja Ivanovic
0765f6451f
[RISCV] Use correct register class for Z[df]inx inline asm (#71872)
Allocate a register of the correct register class for inline asm
constraint "r" when used for FP values with -Zfinx/-Zdinx.

---------

Co-authored-by: Nemanja Ivanovic <nemanja@synopsys.com>
2023-11-17 16:17:48 +01:00
Philip Reames
8f81c605f5
[RISCV] Remove custom instruction selection for VFCVT_RM and friends (#72540)
We already have the pseudo's for lowering these as MI nodes with
rounding mode operands, and the generic FRM insertion pass. Doing the
insertion later in the backend allows SSA level passes to avoid
reasoning about physical register copies, and happens to produce better
code in practice. The later is mostly an accident of our insertion
order; we happen to place the frm write after the vsetvli, and it's very
common for a register to be killed at the vsetvli. End result is that we
get slightly better scalar register allocation.

I'm a bit unclear on the history here. I was surprised to find this code
in ISEL lowering at all, but am also surprised once I found it that all
the patterns and pseudos seem to already exist. My best guess is that
maybe we didn't do all the possible cleanup after introducing the
HasRoundMode mechanism?
2023-11-17 07:07:37 -08:00
Simon Pilgrim
2ed15877e7 [X86] Ensure asm comments only print the constant values for the vector load's register width
We were printing the entire Constant, which if we were loading from a wider constant pool entry meant that we were confusing the asm comment with upper bits that aren't actually part of the load result
2023-11-17 14:30:30 +00:00
Jessica Del
b1e039f3b7
[AMDGPU] - Add constant folding for s_quadmask (#72381)
If the input is a constant we can constant fold the `s_quadmask`
intrinsic.
2023-11-17 15:24:23 +01:00
Simon Pilgrim
0b0440613f [X86] vec_fabs.ll - regenerate checks and add common AVX512 prefixes 2023-11-17 10:31:19 +00:00
Simon Pilgrim
a66085c84c [X86] vec_fabs.ll - sort tests into 128/256/512-bit vector types 2023-11-17 10:31:19 +00:00
Simon Pilgrim
58253dcbcd [X86] getTargetConstantBitsFromNode - bail if we're loading from a constant vector element type larger than the target value size
This can be improved upon by just truncating the constant value, but the crash needs to be addressed first.

Fixes #72539
2023-11-17 10:01:31 +00:00
Philip Reames
233971b475 [RISCV] Fix typo in a test and regen another to reduce test diff 2023-11-16 14:28:16 -08:00
Philip Reames
1aa493f064 [RISCV] Further expand coverage for insert_vector_elt patterns 2023-11-16 14:14:31 -08:00
David Li
ac3779e92e
Enable Custom Lowering for fabs.v8f16 on AVX (#71730)
[X86]: Enable custom lowering for fabs.v8f16 on AVX

Currently, custom lowering of fabs.v8f16 requires AVX512FP16, which is
too restrictive. For v8f16 fabs lowering, no instructions in AVX512FP16
are needed. Without the fix, horribly inefficient code is generated
without AVX512FP16. Note instcombiner generates calls to intrinsics
@llvm.fabs.v8f16 when simplifyping AND <8 x half> operations.
2023-11-16 13:47:31 -08:00
Philip Reames
73e963379e [RISCV] Add test coverage for partial buildvecs idioms
Test coverage for an upcoming set of changes
2023-11-16 13:33:12 -08:00
Craig Topper
927f6f1858 [RISCV] Use bset+addi for (not (sll -1, X)).
This is an alternative to #71420 that handles i32 on RV64 safely
by pre-promoting the pattern in DAG combine.
2023-11-16 11:14:53 -08:00
Craig Topper
4eaf986be4 [RISCV] Add test cases for (not (sll -1, X)) for Zbs. NFC
We can use (ADDI (BSET X0, X), -1).
2023-11-16 11:14:53 -08:00
Momchil Velikov
4ac5b0da8d Revert "[MachineSink][AArch64] Enable sink-and-fold by default (#72132)"
This reverts commit 13fe0386454d2f4c9bad4e20fc59699d1a49b8cf.

May have broken an LLDB test https://lab.llvm.org/buildbot/#/builders/96/builds/48609
2023-11-16 17:07:39 +00:00
Valery Pykhtin
667ba7f8f3
[AMDGPU] Fix GCNRewritePartialRegUses pass: vector regclass is selected instead of scalar. (#69957)
For the following testcase:

undef %1.sub1:sgpr_96 = COPY undef %0:sgpr_32
%3:vgpr_32 = V_LSHL_ADD_U32_e64 %1.sub1:sgpr_96, ...

GCNRewritePartialRegUses produced:

%4:vgpr_32 = COPY undef %1:sgpr_32
dead %2:vgpr_32 = V_LSHL_ADD_U32_e64 %4, ...

Register class for %4 is incorrect: there should be sgpr_32 instead of
vgpr_32 because the original %1 had scalar regclass. This patch fixes
that.

Note that GCNRewritePartialRegUses pass isn't enabled by default yet.
2023-11-16 16:56:46 +01:00
Jay Foad
be2388c0d9
[AMDGPU] Prefer v_madak_f32 over v_madmk_f32 to reduce vgpr pressure (#72506)
As explained in the comment in SIInstrInfo::FoldImmediate, if we have a
choice between v_madak_f32 and v_madmk_f32 we should choose the former
so that the literal that is not folded into the instruction can be
materialized in an sgpr instead of a vgpr.
2023-11-16 12:50:26 +00:00
Momchil Velikov
13fe038645
[MachineSink][AArch64] Enable sink-and-fold by default (#72132)
Enable the optimisation by default for AArch64 after a compile time
regressoin fix in e8209b2486d8
2023-11-16 12:12:56 +00:00
Igor Kirillov
63917e1975
[MachineLICM] Allow hoisting loads from invariant address (#70796)
Sometimes, loads can appear in a loop after the LICM pass is executed
the final time. For example, ExpandMemCmp pass creates loads in a loop,
and one of the operands may be an invariant address.
This patch extends the pre-regalloc stage MachineLICM by allowing to
hoist invariant loads from loops that don't have any stores or calls
and allows load reorderings.
2023-11-16 11:12:10 +00:00
Matt Devereau
e8dd7ecbc4 Revert "[AArch64][SME2] Add ldr_zt, str_zt builtins and intrinsics (#71795)"
This reverts commit cc1244980b74f45a06e2002a33444ce757b577aa.
2023-11-16 11:01:27 +00:00
Valery Pykhtin
24c3cd1a51
[AMDGPU] Update rewrite-partial-reg-uses tests. NFC. (#72499) 2023-11-16 11:48:39 +01:00
Jessica Del
af05f9ff06
[AMDGPU] - Add constant folding for s_bitreplicate (#72366)
If the input is a constant, we can constant fold the s_bitreplicate
operation.
2023-11-16 09:08:00 +01:00
Christudasan Devadasan
ce7fd498ed
[AMDGPU] RA inserted scalar instructions can be at the BB top (#72140)
We adjust the insertion point at the BB top for spills/copies during RA
to ensure they are placed after the exec restore instructions required
for the divergent control flow execution. This is, however, required
only for the vector operations. The insertions for scalar registers can
still go to the BB top.
2023-11-16 10:30:03 +05:30
LiaoChunyu
71a7108ee9 [RISCV][MC] MC layer support for xcvmem and xcvelw extensions
This commit is part of a patch-set to upstream the 7 vendor specific extensions of CV32E40P.
Several other extensions have been merged.
Spec:
https://github.com/openhwgroup/cv32e40p/blob/master/docs/source/instruction_set_extensions.rst
Contributors: @CharKeaney, @jeremybennett, @lewis-revill, Nandni Jamnadas, @PaoloS, @simoncook, @xmj, @realqhc, @melonedo, @adeelahmad81299

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D158824
2023-11-16 09:46:11 +08:00
Fangrui Song
103811a27a
[RISCV,GISel] Unconditionally use MO_PLT for calls (#72355)
All known linkers handle R_RISCV_CALL and R_RISCV_CALL_PLT in the same
way (GNU ld since
https://sourceware.org/pipermail/binutils/2020-August/112750.html).

MO_CALL is for R_RISCV_CALL, a deprecated relocation type. We don't
migrate away from MO_CALL yet.
For GISel we don't have the output difference concern and should weigh
more on simplicity.
2023-11-15 15:18:47 -08:00
Michael Maitland
dbd884cd3d [RISCV][GISEL] Add vector RegisterBanks and vector support in getRegBankFromRegClass
Vector Register banks are created for the various register vector
register groupings. getRegBankFromRegClass is implemented to go from
vector TargetRegisterClass to the corresponding vector RegisterBank.
2023-11-15 15:08:29 -08:00
Michael Maitland
725e599637
[RISCV][GISEL] Add support for scalable vector types in lowerReturnVal (#71587)
Scalable vector types from LLVM IR are lowered into physical vector
registers in MIR based on calling convention for return instructions.
2023-11-15 17:30:53 -05:00
Craig Topper
c281a6add5 [RISCV] Add isel pattern for int_riscv_vfmv_s_f with scalar FP constant operand.
Use vmv_s_x instead of the constant will be materialized in a GPR.

This avoids going from GPR to FPR to vector.

We already did this for RISCVISD::VFMV_S_F_VL and probably we should
just turn int_riscv_vfmv_s_f into RISCVISD::VFMV_S_F_VL, but I'd like
to see some improvements to RISCVInsertVSETVLI first.
2023-11-15 10:51:43 -08:00
Craig Topper
084f5c26a4 [RISCV] Add tests cases to show missed opportunity to turn vfmv.s.f into vmv.s.x when source is FP constant materialized in GPR.
We end up creating the constant in GPR, move to FPR, then move to vector.
We should go directly from GPR to vector.
2023-11-15 10:51:43 -08:00
Artem Belevich
4f33331317
[NVPTX] split long-running wmma.py test into smaller chunks. (#72331) 2023-11-15 09:55:47 -08:00
Craig Topper
1c033aaac9 [RISCV] Add IsSignExtendingOpW to AMO*_W instructions. (#72349) 2023-11-15 09:39:31 -08:00
Craig Topper
e12677db8b [RISCV] Add test cases showing missed opportunity to remove sext.w after amo*.w. NFC
We should tell RISCVOptWInstrs that these instructions sign extend
their results.
2023-11-15 09:37:09 -08:00
Simon Pilgrim
de41396895 [DAG] foldABSToABD - add support for abs(sub(sign_extend_inreg(),sign_extend_inreg())) patterns
Partial fix for ABDS regressions on D152928
2023-11-15 15:49:30 +00:00
petar-avramovic
95dd0b04d2
AMDGPU/SILowerI1Copies process phi incomings in specific order (#72375)
When merging lane masks, value from block that is always visited first
(PrevReg in buildMergeLaneMasks) needs to exist because we do on-the-fly
constant folding. For PrevReg to exist, basic block that should contain
PrevReg definition must be processed first. Sort the incomings such that
incoming values that dominate other incoming values are processed first.

Sorting of phi incomings makes no changes for phis created by SDAG
because SDAG adds phi incomings as it selects basic blocks in reversed
post order traversal.

This change is required by upcoming lane mask merging implementation
for GlobalISel that leaves phi incomings as they are in IR.
2023-11-15 16:27:51 +01:00
Tavian Barnes
75cf672b12
[SDAG] Simplify is-power-of-2 codegen (#72275)
When x is not known to be nonzero, ctpop(x) == 1 is expanded to

    x != 0 && (x & (x - 1)) == 0

resulting in codegen like

    leal    -1(%rdi), %eax
    testl   %eax, %edi
    sete    %cl
    testl   %edi, %edi
    setne   %al
    andb    %cl, %al

But another expression that works is

    (x ^ (x - 1)) > x - 1

which has nicer codegen:

    leal    -1(%rdi), %eax
    xorl    %eax, %edi
    cmpl    %eax, %edi
    seta    %al
2023-11-15 22:26:34 +09:00