49852 Commits

Author SHA1 Message Date
Daniel Hoekwater
866ae69cfa [AArch64] [BranchRelaxation] Optimize for hot code size in AArch64 branch relaxation
On AArch64, it is safe to let the linker handle relaxation of
unconditional branches; in most cases, the destination is within range,
and the linker doesn't need to do anything. If the linker does insert
fixup code, it clobbers the x16 inter-procedural register, so x16 must
be available across the branch before linking. If x16 isn't available,
but some other register is, we can relax the branch either by spilling
x16 OR using the free register for a manually-inserted indirect branch.

This patch builds on D145211. While that patch is for correctness, this
one is for performance of the common case. As noted in
https://reviews.llvm.org/D145211#4537173, we can trust the linker to
relax cross-section unconditional branches across which x16 is
available.

Programs that use machine function splitting care most about the
performance of hot code at the expense of the performance of cold code,
so we prioritize minimizing hot code size.

Here's a breakdown of the cases:

   Hot -> Cold [x16 is free across the branch]
     Do nothing; let the linker relax the branch.

   Cold -> Hot [x16 is free across the branch]
     Do nothing; let the linker relax the branch.

   Hot -> Cold [x16 used across the branch, but there is a free register]
     Spill x16; let the linker relax the branch.

     Spilling requires fewer instructions than manually inserting an
     indirect branch.

   Cold -> Hot [x16 used across the branch, but there is a free register]
     Manually insert an indirect branch.

     Spilling would require adding a restore block in the hot section.

   Hot -> Cold [No free regs]
     Spill x16; let the linker relax the branch.

   Cold -> Hot [No free regs]
     Spill x16 and put the restore block at the end of the hot function; let the linker relax the branch.
     Ex:
       [Hot section]
       func.hot:
         ... hot code...
       func.restore:
         ... restore x16 ...
         B func.hot

       [Cold section]
         func.cold:
         ... spill x16 ...
         B func.restore

     Putting the restore block at the end of the function instead of
     just before the destination increases the cost of executing the
     store, but it avoids putting cold code in the middle of hot code.
     Since the restore is very rarely taken, this is a worthwhile
     tradeoff.

Differential Revision: https://reviews.llvm.org/D156767
2023-09-06 20:44:40 +00:00
Florian Mayer
42a1d16179 Revert "[AMDGPU] Cope with SelectionDAG::UpdateNodeOperands returning a different SDNode (#65340)"
This reverts commit 11171d81aeafb0c2818f288900423e366a2787fc.

Broke ASAN bot.
2023-09-06 13:16:55 -07:00
Vladislav Dzhidzhoev
c39edd7b53 [AArch64][GlobalISel] Regenerate prelegalizercombiner-shuffle-vector.mir 2023-09-06 18:38:13 +02:00
Craig Topper
bb810d8fa0 [RISCV] Disable machine verifier in gisel-commandline-option.ll. NFC
Hopefully this fixes the expensive checks build.
2023-09-06 09:32:32 -07:00
Simon Pilgrim
e4d0e12099 [DAG] Fold (shl (sext (add_nsw x, c1)), c2) -> (add (shl (sext x), c2), c1 << c2) (REAPPLIED)
Assuming the ADD is nsw then it may be sign-extended to merge with a SHL op in a similar fold to the existing (shl (add x, c1), c2) -> (add (shl x, c2), c1 << c2) fold.

This is most useful for helping to expose address math for X86, but has also touched several aarch64 test cases as well.

Alive2: https://alive2.llvm.org/ce/z/2UpSbJ

Differential Revision: https://reviews.llvm.org/D159198
2023-09-06 13:19:42 +01:00
Jay Foad
11171d81ae
[AMDGPU] Cope with SelectionDAG::UpdateNodeOperands returning a different SDNode (#65340)
SITargetLowering::adjustWritemask calls SelectionDAG::UpdateNodeOperands
to update an EXTRACT_SUBREG node in-place to refer to a new IMAGE_LOAD
instruction, before we delete the old IMAGE_LOAD instruction. But in
UpdateNodeOperands can do CSE on the fly and return a different
EXTRACT_SUBREG node, so the original EXTRACT_SUBREG node would still
exist and would refer to the old deleted IMAGE_LOAD instruction. This
caused errors like:

t31: v3i32,ch = <<Deleted Node!>> # D:1
This target-independent node should have been selected!
UNREACHABLE executed at lib/CodeGen/SelectionDAG/InstrEmitter.cpp:1209!

Fix it by detecting the CSE case and replacing all uses of the original
EXTRACT_SUBREG node with the CSE'd one.
2023-09-06 12:51:44 +01:00
Luke Lau
74f985b793
[RISCV] Remove -riscv-v-vector-bits-min in tests. NFC (#65404)
V implies Zvl128b, but a lot of the fixed vector tests also redundantly
specify -riscv-v-vector-bits-min=128. This patch removes them where
there isn't another minimum vlen being tested for, and for cases where
Zve* is being used Zvl128b was added to maintain the old test diff (and
because an awkward vlen probably isn't interesting to test for). Other
places where -risc-v-vector-bits-min were being used were replaced with
Zvl.
2023-09-06 10:43:41 +01:00
Dmitri Gribenko
97bf104d97 Revert "[DAG] Fold (shl (sext (add_nsw x, c1)), c2) -> (add (shl (sext x), c2), c1 << c2)"
This reverts commit b027ce0ab93060bc6cb79d5402d21520e8b93fb7.

This commit breaks Transforms/InferAddressSpaces/AMDGPU/flat_atomic.ll.
2023-09-06 11:28:55 +02:00
Simon Pilgrim
b027ce0ab9 [DAG] Fold (shl (sext (add_nsw x, c1)), c2) -> (add (shl (sext x), c2), c1 << c2)
Assuming the ADD is nsw then it may be sign-extended to merge with a SHL op in a similar fold to the existing (shl (add x, c1), c2) -> (add (shl x, c2), c1 << c2) fold.

This is most useful for helping to expose address math for X86, but has also touched several aarch64 test cases as well.

Alive2: https://alive2.llvm.org/ce/z/2UpSbJ

Differential Revision: https://reviews.llvm.org/D159198
2023-09-06 10:06:21 +01:00
Kito Cheng
af9b25f9db [RISCV] Optimize floating point scalar move and splat
In D158086, we limit all floating point scalar move and splat can't fuse
vsetvli with different SEW, and this patch try to relax the constraint
as possible by introducing new SEW demand type:
SEWGreaterThanOrEqualAndLessThan64, that allow SEW fused with larger
SEW, but constraint it can't fused with SEW=64.

Reviewed By: rogfer01

Differential Revision: https://reviews.llvm.org/D158177
2023-09-06 16:39:30 +08:00
laichunfeng
71b5f57f0d [RISCV] Adjust first sp size to use c.addi16sp.
addi sp, sp, 512 may be used to recover the sp in the epilogue
when stack size is larger than 2047(2^11 - 1), however, it can
not be compressed using C extension, and addi sp, sp, 496 is
able to be compressed, so try to use 496 as the ajust amount of
the fisrt sp if function doesn't need extra instructions after
adjust.

Reviewed By: wangpc

Differential Revision: https://reviews.llvm.org/D159431
2023-09-06 14:26:52 +08:00
Ting Wang
71be020dda [SelectionDAG][PowerPC] Memset reuse vector element for tail store
On PPC there are instructions to store element from vector(e.g.
stxsdx/stxsiwx), and these instructions can be leveraged to avoid tail
constant in memset and constant splat array initialization.

This patch tries to explore these opportunities.

Reviewed By: shchenz

Differential Revision: https://reviews.llvm.org/D138883
2023-09-06 01:52:38 -04:00
Pravin Jagtap
b230472f22
[AMDGPU] Extend v2i16 & v2f16 support for llvm.amdgcn.update.dpp intr (#65318)
Authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2023-09-06 10:20:34 +05:30
Craig Topper
2a7b8ab07c
[RISCV] Use add.uw for (or (and X, 0xFFFFFFFF), Y) if Y has zeroes in the lower 32 bits. (#65402) 2023-09-05 21:05:53 -07:00
Amara Emerson
6c31f20fee
[GlobalISel] Fold fmul x, 1.0 -> x (#65379) 2023-09-06 03:14:16 +08:00
Amy Kwan
f0b2f69541 [AIX][TLS] Generate .extern and .ref references to __tls_get_addr for local-exec accesses.
Compiling with TLS variables requires -pthread, but if the user omits this
option, the compiler will not show any obvious indication during compilation
that -pthread is needed for programs using TLS variables. Instead, the user will
experience a segmentation fault when running programs with TLS variables in them
and without specifying -pthread.

This patch aims to generate .extern/.ref references to __tls_get_addr[DS] for
local-exec accesses, in order to trigger an error from the linker to indicate
that there is an undefined symbol to __tls_get_addr. Doing so will remind the
user to compile/link with -pthread.

Differential Revision: https://reviews.llvm.org/D151335
2023-09-05 12:15:14 -05:00
Simon Pilgrim
e086e0aeef [X86] Add test coverage for new smulo folds added in D159406
Pulled from the InstCombine with_overflow.ll tests
2023-09-05 17:43:42 +01:00
Philip Reames
de34d39b66 [RISCV] Cap build vector cost to avoid quadratic cost at high LMULs
Each vslide1down operation is linear in LMUL on common hardware. (For instance, the sifive-x280 cost model models slides this way.) If we do a VL unique inserts, each with a cost linear in LMUL, the overall cost is O(VL*LMUL).  Since VL is a linear function of LMUL, this means the current lowering is quadradic in both LMUL and VL.  To avoid the degenerate case, fallback to the stack if the cost is more than a fixed (linear) threshold.

For context, here's the sifive-x280 llvm-mca results for the current lowering and stack based lowering for each LMUL (using e64). Assumes code was compiled for V (i.e. zvl128b).
  buildvector_m1_via_stack.mca:Total Cycles: 1904
  buildvector_m2_via_stack.mca:Total Cycles: 2104
  buildvector_m4_via_stack.mca:Total Cycles: 2504
  buildvector_m8_via_stack.mca:Total Cycles: 3304
  buildvector_m1_via_vslide1down.mca:Total Cycles:  804
  buildvector_m2_via_vslide1down.mca:Total Cycles:  1604
  buildvector_m4_via_vslide1down.mca:Total Cycles:  6400
  buildvector_m8_via_vslide1down.mca:Total Cycles: 25599

There are other schemes we could use to cap the cost. The next best is recursive decomposition of the vector into smaller LMULs. That's still quadratic, but with a better constant. However, stack based seems to cost better on all LMULs, so we can just go with the simpler scheme.

Arguably, this patch is fixing a regression introduced with my D149667 as before that change, we'd always fallback to the stack, and thus didn't have the non-linearity.

Differential Revision: https://reviews.llvm.org/D159332
2023-09-05 09:03:26 -07:00
Craig Topper
fa31ce5320
[RISCV][GISel] Add gisel-commandline-option.ll similar to AArch64. (#65299)
This allows us to see the pass pipeline for GlobalISel.
2023-09-05 09:01:50 -07:00
Amara Emerson
08e04209d8
[GlobalISel] Commute G_FMUL and G_FADD constant LHS to RHS. (#65298) 2023-09-05 23:48:34 +08:00
Luke Lau
2fc6fadeaf [RISCV] Fix typo in test title. NFC 2023-09-05 15:57:18 +01:00
Vladislav Dzhidzhoev
13b7629a58 [GlobalISel][AArch64] Combine unmerge(G_EXT v, undef) to unmerge(v).
When having <N x t> d1, unused = unmerge(G_EXT <2*N x t> v1, undef, N),
it is possible to express it just as unused, d1 = unmerge v1.

It is useful for tackling regressions in arm64-vcvt_f.ll, introduced in
https://reviews.llvm.org/D144670.
2023-09-05 16:14:44 +02:00
Vladislav Dzhidzhoev
7eeeeb0cc9 Revert "[GlobalISel][AArch64] Combine unmerge(G_EXT v, undef) to unmerge(v)."
This reverts commit 6b37a65264bb4e7d400d5283a65f9e8e1575f2d7.
Accindentally pushed before squashing.
2023-09-05 16:13:27 +02:00
Vladislav Dzhidzhoev
0e826f0e6d Refactored, added MIR test. 2023-09-05 16:00:48 +02:00
Vladislav Dzhidzhoev
6b37a65264 [GlobalISel][AArch64] Combine unmerge(G_EXT v, undef) to unmerge(v).
When having <N x t> d1, unused = unmerge(G_EXT <2*N x t> v1, undef, N),
it is possible to express it just as unused, d1 = unmerge v1.

It is useful for tackling regressions in arm64-vcvt_f.ll, introduced in
https://reviews.llvm.org/D144670.
2023-09-05 16:00:48 +02:00
Jingu Kang
67fc0d3d39 [AArch64] Remove copy instruction between uaddlv and dup
If there are copy instructions between uaddlv and dup for transfer from gpr to
fpr, try to remove them with duplane.

Differential Revision: https://reviews.llvm.org/D159267
2023-09-05 14:41:28 +01:00
David Sherwood
50598f0ff4 [DAGCombiner][SVE] Add support for illegal extending masked loads
In some cases where the same mask is used for multiple
extending masked loads it can be more efficient to combine
the zero- or sign-extend into the load even if it's not a
legal or custom operation. This leads to splitting up the
extending load into smaller parts, which also requires
splitting the mask. For SVE at least this improves the
performance of the SPEC benchmark x264 slightly on
neoverse-v1 (~0.3%), and at least one other benchmark
improves by around 30%. The uplift for SVE seems due to
removing the dependencies (vector unpacks) introduced
between the loads and the vector operations, since this
should increase the level of parallelism.

See tests:

  CodeGen/AArch64/sve-masked-ldst-sext.ll
  CodeGen/AArch64/sve-masked-ldst-zext.ll

https://reviews.llvm.org/D159191
2023-09-05 10:41:21 +00:00
David Sherwood
64094e3e6d [DAGCombiner] Pre-commit tests for D159191
I've added some missing tests for the following cases:

1. Zero- and sign-extends from unpacked vector types to wide,
   illegal types. For example,
   %aext = zext <vscale x 4 x i8> %a to <vscale x 4 x i64>
2. Normal loads combined with 1
3. Masked loads combined with 1

Differential Revision: https://reviews.llvm.org/D159192
2023-09-05 10:41:21 +00:00
Amara Emerson
12e4921709
[GlobalISel] Constant fold sitofp/uitofp of 0. (#65307) 2023-09-05 17:33:57 +08:00
pvanhout
844c0da777 [TableGen][GlobalISel] Add MIR Pattern Builtins
Adds a new feature to MIR patterns: builtin instructions.
They offer some additional capabilities that currently cannot be expressed without falling back to C++ code.
There are two builtins added with this patch, but more can be added later as new needs arise:
 - GIReplaceReg
 - GIEraseRoot

Depends on D158714, D158713

Reviewed By: arsenm, aemerson

Differential Revision: https://reviews.llvm.org/D158975
2023-09-05 08:19:07 +02:00
Qiu Chaofan
082c5d7f63 [PowerPC] Implement builtin for mffsl
mffsl is available since ISA 3.0. The builtin is named with ppc prefix
to follow our convention. For targets earlier than power9, GCC generates
extra code to support the functionality, while this patch does not
implement such behavior.

Reviewed By: nemanjai, tuliom

Differential Revision: https://reviews.llvm.org/D158065
2023-09-05 11:22:09 +08:00
Nicolai Hähnle
62790a8d4a AMDGPU: Fix test from previous commit 2023-09-05 00:31:49 +02:00
Nicolai Hähnle
f5fb6ad2e5 AMDGPU: Precommit a test file
Demonstrates bad scheduling for private load/store vs. buffer
intrinsics.
2023-09-05 00:17:46 +02:00
Amara Emerson
91746d15d2
[GlobalISel] Fix G_PTR_ADD immediate chain combine using the wrong im… (#65271) 2023-09-05 08:06:40 +08:00
Jay Foad
71ca53b6cf
[GlobalISel] Lower G_SHUFFLE_VECTOR with scalar result (#65275) 2023-09-04 13:32:43 -04:00
Simon Pilgrim
e6971cbc06 [X86] combine-mulo.ll - add common CHECK prefix for SSE/AVX test runs 2023-09-04 16:42:48 +01:00
Amara Emerson
f51b7992c9 [GlobalISel] Precommit a ptradd combine test. 2023-09-04 08:27:20 -07:00
Vladislav Dzhidzhoev
a15144f2ba [AArch64][GlobalISel] Lower G_EXTRACT_VECTOR_ELT with variable indices
G_EXTRACT_VECTOR_ELT instructions with non-constant indices are not
selected, so they need to be lowered.

Fixes https://github.com/llvm/llvm-project/issues/65049.

Reviewed By: Peter

Differential Revision: https://reviews.llvm.org/D159096
2023-09-04 16:19:16 +02:00
sdesmalen-arm
dbf9b93f25
[AArch64][SME] Disable tail-call optimization for __arm_locally_streaming functions. (#65258)
When calling a function which requires no streaming-mode change from an
__arm_locally_streaming function, LLVM would otherwise emit:

  // function prologue
  smstart
  b streaming_compatible_function   // tail call
  // never an smstop
2023-09-04 15:11:22 +01:00
John Brawn
fae3f9ec4f [ARM] Fix prologue/epilogue for pacbti-m leaf functions
R12 is callee-saved in functions with pacbti-m enabled, but this is
done in assignCalleeSavedSpillSlots, meaning that in
determineCalleeSaves we have to manually set CanEliminateFrame.

This fixes a bug where in leaf functions with no other callee-saved
registers the aut instruction wouldn't be emitted and stack offsets
of arguments passed on the stack would be incorrect.

Differential Revision: https://reviews.llvm.org/D157865
2023-09-04 13:46:01 +01:00
Sander de Smalen
702c3f56d3 [SME] Don't scavenge a spillslot in callee-save area in presence of streaming-mode changes.
If no frame-pointer is available and the compiler has scavenged a
spill-slot in the callee-save area, the compiler may be forced to emit an
'addvl' inside the streaming-mode-changing call sequence when it needs to
fill (reload) an FP register being passed to the call.

We can avoid this entirely by disabling stack-slot scavenging when there
are streaming-mode-changing call-sequences in the function.

Reviewed By: david-arm

Differential Revision: https://reviews.llvm.org/D159196
2023-09-04 10:14:44 +00:00
Luke Lau
6098d7d5f6 [RISCV] Lower shuffles as rotates without zvbb
Now that the codegen for the expanded ISD::ROTL sequence has been improved,
it's probably profitable to lower a shuffle that's a rotate to the
vsll+vsrl+vor sequence to avoid a vrgather where possible, even if we don't
have the vror instruction.

This patch relaxes the restriction on ISD::ROTL being legal in
lowerVECTOR_SHUFFLEAsRotate. It also attempts to do the lowering twice: Once
if zvbb is enabled before any of the interleave/deinterleave/vmerge lowerings,
and a second time unconditionally just before it falls back to the vrgather.
This way it doesn't interfere with any of the above patterns that may be more
profitable than the expanded ISD::ROTL sequence.

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D159353
2023-09-04 09:35:12 +01:00
Amara Emerson
0065640f40 [GlobalISel] Look through a G_PTR_ADD's def register instead of it's source operand's
uses when looking for load/store users. This was a simple logic bug during translation
of the equivalent function in SelectionDAG:
```
    for (SDNode *Node : N->uses()) {
      if (auto *LoadStore = dyn_cast<MemSDNode>(Node)) {
```
2023-09-04 00:28:57 -07:00
Amara Emerson
59cbee4599 [GlobalISel] Fix an incorrect ptradd reassoc test. NFC.
The lookthrough int<->ptr cast tests and code were both wrongly checking the wrong
register uses. This change is fixing and precommiting the test to prepare for
the code fix.
2023-09-04 00:28:56 -07:00
Amara Emerson
69d8ca21af [GlobalISel] Regenerate ptradd reassociation tests checks. 2023-09-04 00:03:38 -07:00
Matt Arsenault
65b40f273f RegAlloc: Rename MLRegalloc* files to use consistent captalization
The other regalloc related files use RegAlloc, not Regalloc.
2023-09-03 09:00:27 -04:00
Simon Pilgrim
d9ffd3219e [X86] combineCMP - attempt to simplify KSHIFTR mask element extractions when just comparing against zero (REAPPLIED)
We can just bitcast the pre-shifted mask as an integer and use TEST/BT directly.

Reapplied with fix for 239ab16ec121 which didn't set the comparison type correctly
2023-09-02 17:45:17 +01:00
Simon Pilgrim
600b4634ac [X86] Add test to check that an extracted bool element comparison is correctly extended when the bool vector is bitcast instead
Thanks to @zequanwu for the reduced test case where 239ab16ec121 failed to correctly cast a compare-with-zero to the correct integer type
2023-09-02 17:34:12 +01:00
Matt Arsenault
1f52060000 AMDGPU: Use poison instead of undef in module lds pass 2023-09-02 11:33:26 -04:00
Xiang Li
c21cd168bb
[DirectX backend] avoid generate redundant bitcast in DXILPrepareModule (#65163)
When emit NoOp bitcast for GEP Ptr Operand, should use SourceElementType
instead of ResultElementType.

**Behavior Before Change**
Redundant bitcast like 
   ` bitcast ptr addrspace(3) @gs to ptr addrspace(3)`
 will be generated for llvm/test/CodeGen/DirectX/typed_ptr.ll

**Behavior After Change**
  No bitcast will be generated.

Fixes https://github.com/llvm/llvm-project/issues/65183
2023-09-01 20:08:39 -04:00