59157 Commits

Author SHA1 Message Date
Changpeng Fang
70e78be7dc
AMDGPU: Custom lower fptrunc vectors for f32 -> f16 (#141883)
The latest asics support v_cvt_pk_f16_f32 instruction. However current
implementation of vector fptrunc lowering fully scalarizes the vectors,
and the scalar conversions may not always be combined to generate the
packed one.
We made v2f32 -> v2f16 legal in
https://github.com/llvm/llvm-project/pull/139956. This work is an
extension to handle wider vectors. Instead of fully scalarization, we
split the vector to packs (v2f32 -> v2f16) to ensure the packed
conversion can always been generated.
2025-06-06 15:15:24 -07:00
LU-JOHN
549ce80f27
[AMDGPU][NFC] Add test for 64-bit lshr with shifts >=32 (#138281)
Record current results for 64-bit lshr with shifts >=32.

---------

Signed-off-by: John Lu <John.Lu@amd.com>
2025-06-06 17:04:54 -04:00
LU-JOHN
3bbb49610e
[AMDGPU][NFC] Add tests for 64-bit ashr with shifts >= 32 (#142463)
Record current results for 64-bit ashr with shifts >=32.

Signed-off-by: John Lu <John.Lu@amd.com>
2025-06-06 17:04:45 -04:00
David Green
b0f53d95c1
[AArch64] Add SUBS(CSEL) fold from brcond. (#142103)
This folds away subs(csel(1, 0, cc)) from brcond, that can be produced
in certain places from compares that are not already subs (like adc/sbc
generated from i128 add_with_overflow intrinsics).
2025-06-06 18:00:00 +01:00
David Green
645c0d509c
[AArch64][GlobalISel] Ensure we have a insert-subreg v4i32 GPR pattern (#142724)
This is the GISel equivalent of scalar_to_vector, making sure that when
we insert into undef we use a fmov that avoids the artificial dependency
on the previous register. This adds v2i32 and v2i64 patterns too for
similar reasons.
2025-06-06 17:44:33 +01:00
Guy David
2c0a2261b1
[AArch64] Spare N2I roundtrip when splatting float comparison (#141806)
Transform `select_cc t1, t2, -1, 0` for floats into a vector comparison
which generates a mask, which is later on combined with potential
vectorized DUPs.
2025-06-06 19:07:12 +03:00
David Green
56ebe64ce6
[AArch64] Enable aggressivelyPreferBuildVectorSources (#142729)
This helps to remove some inefficient buildvector lowering by converting
extract_vector_elt(buildvector) to the original source.
2025-06-06 17:03:10 +01:00
Simon Pilgrim
399865cbf0
[X86] combineConcatVectorOps - concat per-lane v2f64/v4f64 shuffles into vXf64 vshufpd (#143017)
We can always concatenate v2f64/v4f64 per-lane shuffles into a single vshufpd instruction, assuming we can profitably concatenate at least one of its operands (or its an unary shuffle).

I was really hoping to get this into combineX86ShufflesRecursively but it still can't handle concatenation/length changing as well as combineConcatVectorOps.
2025-06-06 16:41:40 +01:00
Luke Lau
a029ece7b0
[RISCV] Fix coalescing vsetvlis when AVL and vl registers are the same (#141941)
With EVL tail folding we can end up with vsetvlis where the output vl
and the input AVL are the same register. When we try to coalesce it we
crashed because we tried to move the def's live interval before the
kill's live interval, e.g. in this example:

    (vn0 def)
dead $x0 = PseudoVSETIVLI 1, 192, implicit-def $vl, implicit-def $vtype
    renamable $v9 = COPY killed renamable $v8
(vn1 def) %23:gprnox0 = PseudoVSETVLI killed (vn0) %23:gprnox0, 197,
implicit-def $vl, implicit-def $vtype

We would try to move the vn1 def VNInfo up to the previous VSETVLI, in
the middle of vn0's segment.

However separately, we were also assuming that the vl would only have
one definition and thus were just taking the VNInfo from beginIndex(),
so we ended up with a backwards segment and got the error "Cannot create
empty or backwards segment".

This fixes these two issues, the first one by moving the AVL operand +
live interval up first, and the second by taking the VNInfo from
NextMI's slot index.

Fixes #141907
2025-06-06 17:34:27 +02:00
Durgadoss R
c4012bb5de
[NVPTX] Add pm_event intrinsics (#141278)
This patch adds the pm_event.mask intrinsic and its
clang-builtin.

Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
2025-06-06 19:39:33 +05:30
Feng Zou
efc70787b5
[X86][APX] Prevent from emitting push2/pop2 if stack alignment<16B (#143076)
push2/pop2 requires 16-byte stack alignment. If the stack alignment is
less than that, push2/pop2 should not be emitted. It triggers general
protection exception if the data being pushed/popped by push2/pop2 is
not 16-byte aligned on the stack.
2025-06-06 21:10:21 +08:00
Kerry McLaughlin
df48dfa0ae
[AArch64] Add custom lowering of nxv32i1 get.active.lane.mask nodes (#141969)
performActiveLaneMaskCombine already tries to combine a single
get.active.lane.mask where the low and high halves of the result are
extracted into a single whilelo which operates on a predicate pair.

If the get.active.lane.mask node requires splitting, multiple nodes are
created with saturating adds to increment the starting index. We cannot
combine these into a single whilelo_x2 at this point unless we know
the add will not overflow.

This patch adds custom lowering for the node if the return type is
nxv32xi1, as this can be replaced with a whilelo_x2 using legal types.
Anything wider than nxv32i1 will still require splitting first.
2025-06-06 13:35:34 +01:00
Simon Pilgrim
4e676a1317
[X86] Fold (add X, (srl Y, 7)) -> (sub X, (ashr Y, 7)) on vXi8 vectors (#143106)
Undo the vectorcombine canonicalisation as SSE has awful vXi8 shift support, but can easily splat the MSB using the PCMPGTB(0,x) trick.

Fixes #130549
2025-06-06 13:33:00 +01:00
Benjamin Maxwell
c95bc41562
[AArch64][SDAG] Fix selection of extend of v1if16 SETCC (#140274)
There is a DAG combine, that folds:

```
t1: v1i1 = setcc x:v1f16, y:v1f16, setogt:ch
	t2: v1i64 = zero_extend t1
```

->

```
t1: v1i16 = setcc x:v1f16, y:v1f16, setogt:ch
	t2: v1i64 = any_extend t1
```

This creates an issue on AArch64 when attempting to widen the result to
`v4i16`. The operand types (`v1f16`) are set to be scalarized, so the
"by hand" widening with `DAG.WidenVector` is used for them, however,
this only widens to the next power-of-2, so returns `v2f16`, which does
not match the result VF. The fix is to manually construct the widened
inputs using `INSERT_SUBVECTOR`.

Fixes #136540
2025-06-06 11:20:52 +01:00
Simon Pilgrim
f59742c1ea
[X86] getIntImmCostInst - recognise i64 ICMP EQ/NE special cases (#142812)
If the lower 32-bits of a i64 value are known to be zero, then icmp
lowering will shift+truncate down to a i32 allowing the immediate to be
embedded.

There's a lot more that could be done here to match icmp lowering, but
this PR just focuses on known regressions.

Fixes #142513
Fixes #62145
2025-06-06 10:21:40 +01:00
Jay Foad
5f33b9d286
[MIRParser] Report register class errors in a deterministic order (#142928) 2025-06-06 10:03:34 +01:00
Guy David
4d4b7cc69e
[AArch64] Skip storing of stack arguments when lowering tail calls (#126735)
This issue starts in the selection DAG and causes the backend to emit
the following for a trivial tail call:
```
ldr w8, [sp]
str w8, [sp]
b func
```

I'm not too sure that checking for immutability of a specific stack
object is a good enough of a gurantee, because as soon a tail-call is
done lowering,`setHasTailCall()` is called and in that case perhaps a
pass is allowed to change the value of the object in-memory?

This can be extended to the ARM backend as well.
Removed the `tailcall` keyword from a few other test assets, I'm
assuming their original intent was left intact.
2025-06-06 11:26:24 +03:00
hev
182c1c268f
[LoongArch][NFC] Pre-commit for converting vector mask to vXi1 using [X]VMSKLTZ (#142977) 2025-06-06 16:26:17 +08:00
hev
470f456567
[LoongArch] Add codegen support for atomic-ops on LA32 (#141557)
This patch adds codegen support for atomic operations `cmpxchg`, `max`,
`min`, `umax` and `umin` on the LA32 target.
2025-06-06 16:00:59 +08:00
yingopq
bbe5ceb22f
[Mips] When emit instruction, ignore JUMP_TABLE_DEBUG_INFO (#139830)
When -triple is windows, SelectionDAGLegalize process Legalizing br_jt,
would generate ISD::JUMP_TABLE_DEBUG_INFO.
Then Mips process emitInstruction, would think JUMP_TABLE_DEBUG_INFO is
a pseudo instruction and generate an error `Pseudo opcode found in
emitInstruction()`.
This instruction `TargetOpcode::JUMP_TABLE_DEBUG_INFO` is only used to
note jump table debug info, so we can ignore it when Mips emit
instruction.

Fix #134916.
2025-06-06 15:44:21 +08:00
Jim Lin
f8df24015a
[RISCV] Don't commute with shift if XAndesPerf is enabled (#142920)
More nds.lea.{h,w,d} are generated, similar to sh{1,2,3}add
2025-06-06 11:08:23 +08:00
Jim Lin
d395043300
[RISCV] Select unsigned bitfield insert for XAndesPerf (#142737)
The XAndesPerf extension includes unsigned bitfield extraction
instruction `NDS.BFOZ`, which can extract the bits from 0 to Len -1,
place them starting at bit Msb, and zero-fills the remaining bits.

This patch handles the cases where Msb < Lsb for `NDS.BFOZ`.

Instruction Sytax:

    nds.bfoz Rd, Rs1, Msb, Lsb

The operation is:

    if Msb < Lsb:
        Lenm1 = Lsb - Msb;
        Rd[Lsb:Msb] = Rs1[Lenm1:0];
        if (Lsb < (XLen -1)) Rd[XLen-1:Lsb+1]=0;
        Rd[Msb-1:0]=0;

When Len == 1, it is a special case where the Msb is set to 0 instead of
being equal to the Lsb.
2025-06-06 09:02:18 +08:00
Craig Topper
a23bd179cc
[RISCV] Remove artificial restriction on ShAmt from (shl (and X, C2), C) -> (srli (slli X, C4), C4-C) isel. (#143010)
This code unnecessarily inherited a `ShAmt <= 32` check from an earlier
pattern.
2025-06-05 17:48:35 -07:00
Farzon Lotfi
76c4ba6a1d
Revert "[DirectX] Array GEPs need two indices (#142853)" and "Adjust bit cast instruction filter for DXIL Prepare pass (#142678)" (#143043)
- This reverts commit 9ab4c16042a38d5b80084afff52699e246ca9ea8.
- This reverts commit 1d6e8ec17d547a5f8a0db700dc107a2cd7a321e1.

Noticed a really weird behavior where release and debug builds have
different codegen for loads with geps after this PR. This is going to
take a minute to debug and figure out why so revert seems to make the
most sense.

```diff
diff --git a/llvm/test/CodeGen/DirectX/flatten-array.ll b/llvm/test/CodeGen/DirectX/flatten-array.ll
index 47d7b50cf018..efa9efeff13a 100644
--- a/llvm/test/CodeGen/DirectX/flatten-array.ll
+++ b/llvm/test/CodeGen/DirectX/flatten-array.ll
@@ -123,7 +123,8 @@ define void @gep_4d_test ()  {
@b = internal global [2 x [3 x [4 x i32]]] zeroinitializer, align 16
define void @global_gep_load() {
-  ; CHECK: load i32, ptr getelementptr inbounds ([24 x i32], ptr @a.1dim, i32 0, i32 6), align 4
+  ; CHECK: %1 = getelementptr inbounds [24 x i32], ptr @a.1dim, i32 0, i32 6
+  ; CHECK-NEXT: %2 = load i32, ptr %1, align 4
   ; CHECK-NEXT:    ret void
   %1 = getelementptr inbounds [2 x [3 x [4 x i32]]], [2 x [3 x [4 x i32]]]* @a, i32 0, i32 0
   %2 = getelementptr inbounds [3 x [4 x i32]], [3 x [4 x i32]]* %1, i32 0, i32 1
@@ -176,7 +177,8 @@ define void @global_incomplete_gep_chain(i32 %row, i32 %col) {
}
define void @global_gep_store() {
-  ; CHECK: store i32 1, ptr getelementptr inbounds ([24 x i32], ptr @b.1dim, i32 0, i32 13), align 4
+  ; CHECK: %1 = getelementptr inbounds [24 x i32], ptr @b.1dim, i32 0, i32 13
+  ; CHECK-NEXT: store i32 1, ptr %1, align 4
   ; CHECK-NEXT:    ret void
```
2025-06-05 20:41:37 -04:00
Joshua Batista
1d6e8ec17d
Adjust bit cast instruction filter for DXIL Prepare pass (#142678)
This PR addresses a specific edge case when deciding whether or not to
produce a bitcast instruction.
Specifically, when the given instruction is a global array, the element
type of the array wasn't correctly compared to the return type. In this
specific case, if the types are equal, a bitcast shouldn't be created,
but it was.
This PR checks to see if the element type of the array is the same as
the return type, and if it is, it doesn't create a bitcast instruction.

Fixes https://github.com/llvm/llvm-project/issues/139013
2025-06-05 14:41:14 -07:00
Simon Pilgrim
e953623f50
[X86] combineX86ShuffleChainWithExtract - ensure subvector widening is at index 0 (#143009)
When peeking through insert_subvector(undef,sub,c) widening patterns we
didn't ensure c == 0

Fixes #142995
2025-06-05 21:26:42 +01:00
Stanislav Mekhanoshin
19e2fd5e75
[AMDGPU] Patterns for <2 x bfloat> fneg (fabs) (#142911) 2025-06-05 10:00:24 -07:00
Stanley Gambarin
33974b41c7
[GlobalISel] support lowering of G_SHUFFLEVECTOR with pointer args (#141959) 2025-06-05 09:13:51 -07:00
Brox Chen
d8b245741d
[AMDGPUI][True16][CodeGen] global atomic load i8 in true16 mode (#142822)
Update codegen pattern for global atomic load i8 with d16 instructions
2025-06-05 11:35:18 -04:00
hev
2718a47f49
[LoongArch] Lower vector select mask generation to [X]VMSK{LT,GE,NE}Z if possible (#142109)
This patch adds a DAG combine rule for BITCAST nodes converting from
vector `i1` masks generated by `setcc` into integer vector types. It
recognizes common select mask patterns and lowers them into efficient
LoongArch LSX/LASX mask instructions such as:

- [X]VMSKLTZ.{B,H,W,D}
- [X]VMSKGEZ.B
- [X]VMSKNEZ.B

When the vector comparison matches specific patterns (e.g., x < 0, x >=
0, x != 0, etc.), the transformation is performed pre-legalization. This
avoids scalarization and unnecessary operations, improving both
performance and code size.
2025-06-05 22:17:38 +08:00
Nick Sarnie
3b9ebe9201
[clang] Simplify device kernel attributes (#137882)
We have multiple different attributes in clang representing device
kernels for specific targets/languages. Refactor them into one attribute
with different spellings to make it more easily scalable for new
languages/targets.

---------

Signed-off-by: Sarnie, Nick <nick.sarnie@intel.com>
2025-06-05 14:15:38 +00:00
Harrison Hao
b2379bd5d5
[AMDGPU] Support bottom-up postRA scheduing. (#135295)
Solely relying on top‑down scheduling can underutilize hardware, since
long‑latency instructions often end up scheduled too late and their
latency isn’t well hidden. Adding bottom‑up post‑RA scheduling lets us
move those instructions earlier, which improves latency hiding and
yields roughly a 2% performance gain on key benchmarks.
2025-06-05 22:07:06 +08:00
zhijian lin
a91b0d2780
[PowerPC] hoist xxspltiw instruction out of the loop with FMA mutation pass. (#111696)
Summary: 
   
The patch fixes the issue [[PowerPC] missing VSX FMA Mutation optimize
in some case for option -schedule-ppc-vsx-fma-mutation-early
#111906](https://github.com/llvm/llvm-project/issues/111906)
   
In certain cases, the Register Coalescer pass—which eliminates COPY
instructions—can interfere with the PowerPC VSX FMA Mutation pass.
Specifically, it can prevent the mutation of a COPY adjacent to an
XSMADDADP into a single XSMADDMDP instruction. As a result, the xxspltiw
instruction is not hoisted out of the loop as expected, leading to
missed optimization opportunities.

To address this, the patch ensures that the `VSX FMA Mutation` pass runs
before the `Register Coalescer` pass when the
-schedule-ppc-vsx-fma-mutation-early option is enabled.
2025-06-05 09:41:51 -04:00
Phoebe Wang
754f2caa5c
[X86][FP16] Widen UI2FP for FP16 when VLX not enabled (#142956)
Fixes: https://godbolt.org/z/5vc8oMhxz
2025-06-05 21:14:04 +08:00
Ryotaro Kasuga
ef60ee6005
[MachinePipeliner] Introduce a new class for loop-carried deps (#137663)
In MachinePipeliner, loop-carried memory dependencies are represented by
DAG, which makes things complicated and causes some necessary
dependencies to be missing. This patch introduces a new class to manage
loop-carried memory dependencies to simplify the logic. The ultimate
goal is to add currently missing dependencies, but this is a first step
of that, and this patch doesn't intend to change current behavior. This
patch also adds new tests that show the missed dependencies, which
should be fixed in the future.

Split off from #135148
2025-06-05 21:30:27 +09:00
hev
d979423fb0
[LoongArch][NFC] Pre-commit for lowering vector mask generation to [X]VMSK{LT,GE,NE}Z (#142108) 2025-06-05 20:26:09 +08:00
Phoebe Wang
60808a45dc
[X86][FP16] Add tests for inttofp without VLX, NFC (#142954) 2025-06-05 20:14:06 +08:00
Simon Pilgrim
d88067c341 [X86] combineTargetShuffle - canonicalize vperm2x128(x,x)/vperm2x128(undef,x) -> vperm2x128(x,undef)
Improves fold matching for future patches.
2025-06-05 12:50:33 +01:00
Mahesh-Attarde
3737e7e273
[X86][GlobalIsel] add test for fabs isel (#142558)
G_FABS Test update for https://github.com/llvm/llvm-project/pull/136718

---------

Co-authored-by: mattarde <mattarde@intel.com>
2025-06-05 10:34:33 +01:00
Stanislav Mekhanoshin
9b992f29e0
[AMDGPU] Baseline fneg-fabs.bf16.ll tests. NFC. (#142910) 2025-06-05 02:23:23 -07:00
Stanislav Mekhanoshin
0c1c60fa63
[AMDGPU] Make <2 x bfloat> fabs legal (#142908) 2025-06-05 02:22:09 -07:00
Stanislav Mekhanoshin
100a1d0c4c
[AMDGPU] Baseline fabs.bf16.ll tests. NFC. (#142907) 2025-06-05 00:57:10 -07:00
Phoebe Wang
0c89cbb484
[X86][FP16] Widen 128/256-bit CVTTP2xI to 512-bit when VLX not enabled (#142763) 2025-06-05 15:32:25 +08:00
Ruiling, Song
0487db1f13
MachineScheduler: Improve instruction clustering (#137784)
The existing way of managing clustered nodes was done through adding
weak edges between the neighbouring cluster nodes, which is a sort of
ordered queue. And this will be later recorded as `NextClusterPred` or
`NextClusterSucc` in `ScheduleDAGMI`.

But actually the instruction may be picked not in the exact order of the
queue. For example, we have a queue of cluster nodes A B C. But during
scheduling, node B might be picked first, then it will be very likely
that we only cluster B and C for Top-Down scheduling (leaving A alone).

Another issue is:
```
   if (!ReorderWhileClustering && SUa->NodeNum > SUb->NodeNum)
      std::swap(SUa, SUb);
   if (!DAG->addEdge(SUb, SDep(SUa, SDep::Cluster)))
```
may break the cluster queue.

For example, we want to cluster nodes (order as in `MemOpRecords`): 1 3
2. 1(SUa) will be pred of 3(SUb) normally. But when it comes to (3, 2),
As 3(SUa) > 2(SUb), we would reorder the two nodes, which makes 2 be
pred of 3. This makes both 1 and 2 become preds of 3, but there is no
edge between 1 and 2. Thus we get a broken cluster chain.

To fix both issues, we introduce an unordered set in the change. This
could help improve clustering in some hard case.

One key reason the change causes so many test check changes is: As the
cluster candidates are not ordered now, the candidates might be picked
in different order from before.

The most affected targets are: AMDGPU, AArch64, RISCV.

For RISCV, it seems to me most are just minor instruction reorder, don't
see obvious regression.

For AArch64, there were some combining of ldr into ldp being affected.
With two cases being regressed and two being improved. This has more
deeper reason that machine scheduler cannot cluster them well both
before and after the change, and the load combine algorithm later is
also not smart enough.

For AMDGPU, some cases have more v_dual instructions used while some are
regressed. It seems less critical. Seems like test `v_vselect_v32bf16`
gets more buffer_load being claused.
2025-06-05 15:28:04 +08:00
Simon Pilgrim
dba4188167
[X86] combineAdd - fold (add (sub (shl x, c), y), z) -> (sub (add (shl x, c), z), y) (#142734)
Attempt to keep adds/shifts closer together for LEA matching

Fixes #55714
2025-06-05 08:20:44 +01:00
Stanislav Mekhanoshin
a56442529c
[AMDGPU] Make <2 x bfloat> fneg legal (#142870) 2025-06-04 22:09:25 -07:00
Shilei Tian
8cd5604f59
[AMDGPU][AtomicExpand] Use full flat emulation if a target supports f64 global atomic add instruction (#142859)
If a target supports f64 global atomic add instruction, we can also use
full flat emulation.
2025-06-05 00:45:42 -04:00
Stanislav Mekhanoshin
9d41159023
[AMDGPU] Add baseline fneg.bf16.ll tests. NFC. (#142866)
This is a copy of the fneg.f16.ll, just with type replaced.
The final logic shall be the same as with f16 as these are
just bit operations.

---------

Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com>
2025-06-04 20:54:03 -07:00
Harrison Hao
8ca220f1dd
[NFC][AMDGPU] Add lit tests for FMA combining with freeze and nnan variants (#142628)
`freeze` on `fmul` (without `nnan`) followed by `fadd` or `fsub` into a
single `fma` is supported.
This patch adds lit tests to verify the optimization behavior for both
nnan and non-nnan variants.
2025-06-05 09:55:44 +08:00
Acthink Yang
7263cd48e6
[LegalizeTypes][MSP430] Soften FAKE_USE operand (#142714)
Adds support for softening FAKE_USE operands.
Adds MSP430 tests that exercise the new softening code.

Fixes #137572
2025-06-05 10:53:57 +09:00