llvm-project

Author	SHA1	Message	Date
Changpeng Fang	70e78be7dc	AMDGPU: Custom lower fptrunc vectors for f32 -> f16 (#141883 ) The latest asics support v_cvt_pk_f16_f32 instruction. However current implementation of vector fptrunc lowering fully scalarizes the vectors, and the scalar conversions may not always be combined to generate the packed one. We made v2f32 -> v2f16 legal in https://github.com/llvm/llvm-project/pull/139956. This work is an extension to handle wider vectors. Instead of fully scalarization, we split the vector to packs (v2f32 -> v2f16) to ensure the packed conversion can always been generated.	2025-06-06 15:15:24 -07:00
LU-JOHN	549ce80f27	[AMDGPU][NFC] Add test for 64-bit lshr with shifts >=32 (#138281 ) Record current results for 64-bit lshr with shifts >=32. --------- Signed-off-by: John Lu <John.Lu@amd.com>	2025-06-06 17:04:54 -04:00
LU-JOHN	3bbb49610e	[AMDGPU][NFC] Add tests for 64-bit ashr with shifts >= 32 (#142463 ) Record current results for 64-bit ashr with shifts >=32. Signed-off-by: John Lu <John.Lu@amd.com>	2025-06-06 17:04:45 -04:00
David Green	b0f53d95c1	[AArch64] Add SUBS(CSEL) fold from brcond. (#142103 ) This folds away subs(csel(1, 0, cc)) from brcond, that can be produced in certain places from compares that are not already subs (like adc/sbc generated from i128 add_with_overflow intrinsics).	2025-06-06 18:00:00 +01:00
David Green	645c0d509c	[AArch64][GlobalISel] Ensure we have a insert-subreg v4i32 GPR pattern (#142724 ) This is the GISel equivalent of scalar_to_vector, making sure that when we insert into undef we use a fmov that avoids the artificial dependency on the previous register. This adds v2i32 and v2i64 patterns too for similar reasons.	2025-06-06 17:44:33 +01:00
Guy David	2c0a2261b1	[AArch64] Spare N2I roundtrip when splatting float comparison (#141806 ) Transform `select_cc t1, t2, -1, 0` for floats into a vector comparison which generates a mask, which is later on combined with potential vectorized DUPs.	2025-06-06 19:07:12 +03:00
David Green	56ebe64ce6	[AArch64] Enable aggressivelyPreferBuildVectorSources (#142729 ) This helps to remove some inefficient buildvector lowering by converting extract_vector_elt(buildvector) to the original source.	2025-06-06 17:03:10 +01:00
Simon Pilgrim	399865cbf0	[X86] combineConcatVectorOps - concat per-lane v2f64/v4f64 shuffles into vXf64 vshufpd (#143017 ) We can always concatenate v2f64/v4f64 per-lane shuffles into a single vshufpd instruction, assuming we can profitably concatenate at least one of its operands (or its an unary shuffle). I was really hoping to get this into combineX86ShufflesRecursively but it still can't handle concatenation/length changing as well as combineConcatVectorOps.	2025-06-06 16:41:40 +01:00
Luke Lau	a029ece7b0	[RISCV] Fix coalescing vsetvlis when AVL and vl registers are the same (#141941 ) With EVL tail folding we can end up with vsetvlis where the output vl and the input AVL are the same register. When we try to coalesce it we crashed because we tried to move the def's live interval before the kill's live interval, e.g. in this example: (vn0 def) dead $x0 = PseudoVSETIVLI 1, 192, implicit-def $vl, implicit-def $vtype renamable $v9 = COPY killed renamable $v8 (vn1 def) %23:gprnox0 = PseudoVSETVLI killed (vn0) %23:gprnox0, 197, implicit-def $vl, implicit-def $vtype We would try to move the vn1 def VNInfo up to the previous VSETVLI, in the middle of vn0's segment. However separately, we were also assuming that the vl would only have one definition and thus were just taking the VNInfo from beginIndex(), so we ended up with a backwards segment and got the error "Cannot create empty or backwards segment". This fixes these two issues, the first one by moving the AVL operand + live interval up first, and the second by taking the VNInfo from NextMI's slot index. Fixes #141907	2025-06-06 17:34:27 +02:00
Durgadoss R	c4012bb5de	[NVPTX] Add pm_event intrinsics (#141278 ) This patch adds the pm_event.mask intrinsic and its clang-builtin. Signed-off-by: Durgadoss R <durgadossr@nvidia.com>	2025-06-06 19:39:33 +05:30
Feng Zou	efc70787b5	[X86][APX] Prevent from emitting push2/pop2 if stack alignment<16B (#143076 ) push2/pop2 requires 16-byte stack alignment. If the stack alignment is less than that, push2/pop2 should not be emitted. It triggers general protection exception if the data being pushed/popped by push2/pop2 is not 16-byte aligned on the stack.	2025-06-06 21:10:21 +08:00
Kerry McLaughlin	df48dfa0ae	[AArch64] Add custom lowering of nxv32i1 get.active.lane.mask nodes (#141969 ) performActiveLaneMaskCombine already tries to combine a single get.active.lane.mask where the low and high halves of the result are extracted into a single whilelo which operates on a predicate pair. If the get.active.lane.mask node requires splitting, multiple nodes are created with saturating adds to increment the starting index. We cannot combine these into a single whilelo_x2 at this point unless we know the add will not overflow. This patch adds custom lowering for the node if the return type is nxv32xi1, as this can be replaced with a whilelo_x2 using legal types. Anything wider than nxv32i1 will still require splitting first.	2025-06-06 13:35:34 +01:00
Simon Pilgrim	4e676a1317	[X86] Fold (add X, (srl Y, 7)) -> (sub X, (ashr Y, 7)) on vXi8 vectors (#143106 ) Undo the vectorcombine canonicalisation as SSE has awful vXi8 shift support, but can easily splat the MSB using the PCMPGTB(0,x) trick. Fixes #130549	2025-06-06 13:33:00 +01:00
Benjamin Maxwell	c95bc41562	[AArch64][SDAG] Fix selection of extend of v1if16 SETCC (#140274 ) There is a DAG combine, that folds: ``` t1: v1i1 = setcc x:v1f16, y:v1f16, setogt:ch t2: v1i64 = zero_extend t1 ``` -> ``` t1: v1i16 = setcc x:v1f16, y:v1f16, setogt:ch t2: v1i64 = any_extend t1 ``` This creates an issue on AArch64 when attempting to widen the result to `v4i16`. The operand types (`v1f16`) are set to be scalarized, so the "by hand" widening with `DAG.WidenVector` is used for them, however, this only widens to the next power-of-2, so returns `v2f16`, which does not match the result VF. The fix is to manually construct the widened inputs using `INSERT_SUBVECTOR`. Fixes #136540	2025-06-06 11:20:52 +01:00
Simon Pilgrim	f59742c1ea	[X86] getIntImmCostInst - recognise i64 ICMP EQ/NE special cases (#142812 ) If the lower 32-bits of a i64 value are known to be zero, then icmp lowering will shift+truncate down to a i32 allowing the immediate to be embedded. There's a lot more that could be done here to match icmp lowering, but this PR just focuses on known regressions. Fixes #142513 Fixes #62145	2025-06-06 10:21:40 +01:00
Jay Foad	5f33b9d286	[MIRParser] Report register class errors in a deterministic order (#142928 )	2025-06-06 10:03:34 +01:00
Guy David	4d4b7cc69e	[AArch64] Skip storing of stack arguments when lowering tail calls (#126735 ) This issue starts in the selection DAG and causes the backend to emit the following for a trivial tail call: ``` ldr w8, [sp] str w8, [sp] b func ``` I'm not too sure that checking for immutability of a specific stack object is a good enough of a gurantee, because as soon a tail-call is done lowering,`setHasTailCall()` is called and in that case perhaps a pass is allowed to change the value of the object in-memory? This can be extended to the ARM backend as well. Removed the `tailcall` keyword from a few other test assets, I'm assuming their original intent was left intact.	2025-06-06 11:26:24 +03:00
hev	182c1c268f	[LoongArch][NFC] Pre-commit for converting vector mask to `vXi1` using `[X]VMSKLTZ` (#142977 )	2025-06-06 16:26:17 +08:00
hev	470f456567	[LoongArch] Add codegen support for atomic-ops on LA32 (#141557 ) This patch adds codegen support for atomic operations `cmpxchg`, `max`, `min`, `umax` and `umin` on the LA32 target.	2025-06-06 16:00:59 +08:00
yingopq	bbe5ceb22f	[Mips] When emit instruction, ignore JUMP_TABLE_DEBUG_INFO (#139830 ) When -triple is windows, SelectionDAGLegalize process Legalizing br_jt, would generate ISD::JUMP_TABLE_DEBUG_INFO. Then Mips process emitInstruction, would think JUMP_TABLE_DEBUG_INFO is a pseudo instruction and generate an error `Pseudo opcode found in emitInstruction()`. This instruction `TargetOpcode::JUMP_TABLE_DEBUG_INFO` is only used to note jump table debug info, so we can ignore it when Mips emit instruction. Fix #134916.	2025-06-06 15:44:21 +08:00
Jim Lin	f8df24015a	[RISCV] Don't commute with shift if XAndesPerf is enabled (#142920 ) More nds.lea.{h,w,d} are generated, similar to sh{1,2,3}add	2025-06-06 11:08:23 +08:00
Jim Lin	d395043300	[RISCV] Select unsigned bitfield insert for XAndesPerf (#142737 ) The XAndesPerf extension includes unsigned bitfield extraction instruction `NDS.BFOZ`, which can extract the bits from 0 to Len -1, place them starting at bit Msb, and zero-fills the remaining bits. This patch handles the cases where Msb < Lsb for `NDS.BFOZ`. Instruction Sytax: nds.bfoz Rd, Rs1, Msb, Lsb The operation is: if Msb < Lsb: Lenm1 = Lsb - Msb; Rd[Lsb:Msb] = Rs1[Lenm1:0]; if (Lsb < (XLen -1)) Rd[XLen-1:Lsb+1]=0; Rd[Msb-1:0]=0; When Len == 1, it is a special case where the Msb is set to 0 instead of being equal to the Lsb.	2025-06-06 09:02:18 +08:00
Craig Topper	a23bd179cc	[RISCV] Remove artificial restriction on ShAmt from (shl (and X, C2), C) -> (srli (slli X, C4), C4-C) isel. (#143010 ) This code unnecessarily inherited a `ShAmt <= 32` check from an earlier pattern.	2025-06-05 17:48:35 -07:00
Farzon Lotfi	76c4ba6a1d	Revert "[DirectX] Array GEPs need two indices (#142853 )" and "Adjust bit cast instruction filter for DXIL Prepare pass (#142678 )" (#143043 ) - This reverts commit 9ab4c16042a38d5b80084afff52699e246ca9ea8. - This reverts commit 1d6e8ec17d547a5f8a0db700dc107a2cd7a321e1. Noticed a really weird behavior where release and debug builds have different codegen for loads with geps after this PR. This is going to take a minute to debug and figure out why so revert seems to make the most sense. ```diff diff --git a/llvm/test/CodeGen/DirectX/flatten-array.ll b/llvm/test/CodeGen/DirectX/flatten-array.ll index 47d7b50cf018..efa9efeff13a 100644 --- a/llvm/test/CodeGen/DirectX/flatten-array.ll +++ b/llvm/test/CodeGen/DirectX/flatten-array.ll @@ -123,7 +123,8 @@ define void @gep_4d_test () { @b = internal global [2 x [3 x [4 x i32]]] zeroinitializer, align 16 define void @global_gep_load() { - ; CHECK: load i32, ptr getelementptr inbounds ([24 x i32], ptr @a.1dim, i32 0, i32 6), align 4 + ; CHECK: %1 = getelementptr inbounds [24 x i32], ptr @a.1dim, i32 0, i32 6 + ; CHECK-NEXT: %2 = load i32, ptr %1, align 4 ; CHECK-NEXT: ret void %1 = getelementptr inbounds [2 x [3 x [4 x i32]]], [2 x [3 x [4 x i32]]]* @a, i32 0, i32 0 %2 = getelementptr inbounds [3 x [4 x i32]], [3 x [4 x i32]]* %1, i32 0, i32 1 @@ -176,7 +177,8 @@ define void @global_incomplete_gep_chain(i32 %row, i32 %col) { } define void @global_gep_store() { - ; CHECK: store i32 1, ptr getelementptr inbounds ([24 x i32], ptr @b.1dim, i32 0, i32 13), align 4 + ; CHECK: %1 = getelementptr inbounds [24 x i32], ptr @b.1dim, i32 0, i32 13 + ; CHECK-NEXT: store i32 1, ptr %1, align 4 ; CHECK-NEXT: ret void ```	2025-06-05 20:41:37 -04:00
Joshua Batista	1d6e8ec17d	Adjust bit cast instruction filter for DXIL Prepare pass (#142678 ) This PR addresses a specific edge case when deciding whether or not to produce a bitcast instruction. Specifically, when the given instruction is a global array, the element type of the array wasn't correctly compared to the return type. In this specific case, if the types are equal, a bitcast shouldn't be created, but it was. This PR checks to see if the element type of the array is the same as the return type, and if it is, it doesn't create a bitcast instruction. Fixes https://github.com/llvm/llvm-project/issues/139013	2025-06-05 14:41:14 -07:00
Simon Pilgrim	e953623f50	[X86] combineX86ShuffleChainWithExtract - ensure subvector widening is at index 0 (#143009 ) When peeking through insert_subvector(undef,sub,c) widening patterns we didn't ensure c == 0 Fixes #142995	2025-06-05 21:26:42 +01:00
Stanislav Mekhanoshin	19e2fd5e75	[AMDGPU] Patterns for <2 x bfloat> fneg (fabs) (#142911 )	2025-06-05 10:00:24 -07:00
Stanley Gambarin	33974b41c7	[GlobalISel] support lowering of G_SHUFFLEVECTOR with pointer args (#141959 )	2025-06-05 09:13:51 -07:00
Brox Chen	d8b245741d	[AMDGPUI][True16][CodeGen] global atomic load i8 in true16 mode (#142822 ) Update codegen pattern for global atomic load i8 with d16 instructions	2025-06-05 11:35:18 -04:00
hev	2718a47f49	[LoongArch] Lower vector select mask generation to `[X]VMSK{LT,GE,NE}Z` if possible (#142109 ) This patch adds a DAG combine rule for BITCAST nodes converting from vector `i1` masks generated by `setcc` into integer vector types. It recognizes common select mask patterns and lowers them into efficient LoongArch LSX/LASX mask instructions such as: - [X]VMSKLTZ.{B,H,W,D} - [X]VMSKGEZ.B - [X]VMSKNEZ.B When the vector comparison matches specific patterns (e.g., x < 0, x >= 0, x != 0, etc.), the transformation is performed pre-legalization. This avoids scalarization and unnecessary operations, improving both performance and code size.	2025-06-05 22:17:38 +08:00
Nick Sarnie	3b9ebe9201	[clang] Simplify device kernel attributes (#137882 ) We have multiple different attributes in clang representing device kernels for specific targets/languages. Refactor them into one attribute with different spellings to make it more easily scalable for new languages/targets. --------- Signed-off-by: Sarnie, Nick <nick.sarnie@intel.com>	2025-06-05 14:15:38 +00:00
Harrison Hao	b2379bd5d5	[AMDGPU] Support bottom-up postRA scheduing. (#135295 ) Solely relying on top‑down scheduling can underutilize hardware, since long‑latency instructions often end up scheduled too late and their latency isn’t well hidden. Adding bottom‑up post‑RA scheduling lets us move those instructions earlier, which improves latency hiding and yields roughly a 2% performance gain on key benchmarks.	2025-06-05 22:07:06 +08:00
zhijian lin	a91b0d2780	[PowerPC] hoist xxspltiw instruction out of the loop with FMA mutation pass. (#111696 ) Summary: The patch fixes the issue [[PowerPC] missing VSX FMA Mutation optimize in some case for option -schedule-ppc-vsx-fma-mutation-early #111906](https://github.com/llvm/llvm-project/issues/111906) In certain cases, the Register Coalescer pass—which eliminates COPY instructions—can interfere with the PowerPC VSX FMA Mutation pass. Specifically, it can prevent the mutation of a COPY adjacent to an XSMADDADP into a single XSMADDMDP instruction. As a result, the xxspltiw instruction is not hoisted out of the loop as expected, leading to missed optimization opportunities. To address this, the patch ensures that the `VSX FMA Mutation` pass runs before the `Register Coalescer` pass when the -schedule-ppc-vsx-fma-mutation-early option is enabled.	2025-06-05 09:41:51 -04:00
Phoebe Wang	754f2caa5c	[X86][FP16] Widen UI2FP for FP16 when VLX not enabled (#142956 ) Fixes: https://godbolt.org/z/5vc8oMhxz	2025-06-05 21:14:04 +08:00
Ryotaro Kasuga	ef60ee6005	[MachinePipeliner] Introduce a new class for loop-carried deps (#137663 ) In MachinePipeliner, loop-carried memory dependencies are represented by DAG, which makes things complicated and causes some necessary dependencies to be missing. This patch introduces a new class to manage loop-carried memory dependencies to simplify the logic. The ultimate goal is to add currently missing dependencies, but this is a first step of that, and this patch doesn't intend to change current behavior. This patch also adds new tests that show the missed dependencies, which should be fixed in the future. Split off from #135148	2025-06-05 21:30:27 +09:00
hev	d979423fb0	[LoongArch][NFC] Pre-commit for lowering vector mask generation to `[X]VMSK{LT,GE,NE}Z` (#142108 )	2025-06-05 20:26:09 +08:00
Phoebe Wang	60808a45dc	[X86][FP16] Add tests for inttofp without VLX, NFC (#142954 )	2025-06-05 20:14:06 +08:00
Simon Pilgrim	d88067c341	[X86] combineTargetShuffle - canonicalize vperm2x128(x,x)/vperm2x128(undef,x) -> vperm2x128(x,undef) Improves fold matching for future patches.	2025-06-05 12:50:33 +01:00
Mahesh-Attarde	3737e7e273	[X86][GlobalIsel] add test for fabs isel (#142558 ) G_FABS Test update for https://github.com/llvm/llvm-project/pull/136718 --------- Co-authored-by: mattarde <mattarde@intel.com>	2025-06-05 10:34:33 +01:00
Stanislav Mekhanoshin	9b992f29e0	[AMDGPU] Baseline fneg-fabs.bf16.ll tests. NFC. (#142910 )	2025-06-05 02:23:23 -07:00
Stanislav Mekhanoshin	0c1c60fa63	[AMDGPU] Make <2 x bfloat> fabs legal (#142908 )	2025-06-05 02:22:09 -07:00
Stanislav Mekhanoshin	100a1d0c4c	[AMDGPU] Baseline fabs.bf16.ll tests. NFC. (#142907 )	2025-06-05 00:57:10 -07:00
Phoebe Wang	0c89cbb484	[X86][FP16] Widen 128/256-bit CVTTP2xI to 512-bit when VLX not enabled (#142763 )	2025-06-05 15:32:25 +08:00
Ruiling, Song	0487db1f13	MachineScheduler: Improve instruction clustering (#137784 ) The existing way of managing clustered nodes was done through adding weak edges between the neighbouring cluster nodes, which is a sort of ordered queue. And this will be later recorded as `NextClusterPred` or `NextClusterSucc` in `ScheduleDAGMI`. But actually the instruction may be picked not in the exact order of the queue. For example, we have a queue of cluster nodes A B C. But during scheduling, node B might be picked first, then it will be very likely that we only cluster B and C for Top-Down scheduling (leaving A alone). Another issue is: ``` if (!ReorderWhileClustering && SUa->NodeNum > SUb->NodeNum) std::swap(SUa, SUb); if (!DAG->addEdge(SUb, SDep(SUa, SDep::Cluster))) ``` may break the cluster queue. For example, we want to cluster nodes (order as in `MemOpRecords`): 1 3 2. 1(SUa) will be pred of 3(SUb) normally. But when it comes to (3, 2), As 3(SUa) > 2(SUb), we would reorder the two nodes, which makes 2 be pred of 3. This makes both 1 and 2 become preds of 3, but there is no edge between 1 and 2. Thus we get a broken cluster chain. To fix both issues, we introduce an unordered set in the change. This could help improve clustering in some hard case. One key reason the change causes so many test check changes is: As the cluster candidates are not ordered now, the candidates might be picked in different order from before. The most affected targets are: AMDGPU, AArch64, RISCV. For RISCV, it seems to me most are just minor instruction reorder, don't see obvious regression. For AArch64, there were some combining of ldr into ldp being affected. With two cases being regressed and two being improved. This has more deeper reason that machine scheduler cannot cluster them well both before and after the change, and the load combine algorithm later is also not smart enough. For AMDGPU, some cases have more v_dual instructions used while some are regressed. It seems less critical. Seems like test `v_vselect_v32bf16` gets more buffer_load being claused.	2025-06-05 15:28:04 +08:00
Simon Pilgrim	dba4188167	[X86] combineAdd - fold (add (sub (shl x, c), y), z) -> (sub (add (shl x, c), z), y) (#142734 ) Attempt to keep adds/shifts closer together for LEA matching Fixes #55714	2025-06-05 08:20:44 +01:00
Stanislav Mekhanoshin	a56442529c	[AMDGPU] Make <2 x bfloat> fneg legal (#142870 )	2025-06-04 22:09:25 -07:00
Shilei Tian	8cd5604f59	[AMDGPU][AtomicExpand] Use full flat emulation if a target supports f64 global atomic add instruction (#142859 ) If a target supports f64 global atomic add instruction, we can also use full flat emulation.	2025-06-05 00:45:42 -04:00
Stanislav Mekhanoshin	9d41159023	[AMDGPU] Add baseline fneg.bf16.ll tests. NFC. (#142866 ) This is a copy of the fneg.f16.ll, just with type replaced. The final logic shall be the same as with f16 as these are just bit operations. --------- Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com>	2025-06-04 20:54:03 -07:00
Harrison Hao	8ca220f1dd	[NFC][AMDGPU] Add lit tests for FMA combining with freeze and nnan variants (#142628 ) `freeze` on `fmul` (without `nnan`) followed by `fadd` or `fsub` into a single `fma` is supported. This patch adds lit tests to verify the optimization behavior for both nnan and non-nnan variants.	2025-06-05 09:55:44 +08:00
Acthink Yang	7263cd48e6	[LegalizeTypes][MSP430] Soften FAKE_USE operand (#142714 ) Adds support for softening FAKE_USE operands. Adds MSP430 tests that exercise the new softening code. Fixes #137572	2025-06-05 10:53:57 +09:00

1 2 3 4 5 ...

59157 Commits