llvm-project

Author	SHA1	Message	Date
Changpeng Fang	70e78be7dc	AMDGPU: Custom lower fptrunc vectors for f32 -> f16 (#141883 ) The latest asics support v_cvt_pk_f16_f32 instruction. However current implementation of vector fptrunc lowering fully scalarizes the vectors, and the scalar conversions may not always be combined to generate the packed one. We made v2f32 -> v2f16 legal in https://github.com/llvm/llvm-project/pull/139956. This work is an extension to handle wider vectors. Instead of fully scalarization, we split the vector to packs (v2f32 -> v2f16) to ensure the packed conversion can always been generated.	2025-06-06 15:15:24 -07:00
LU-JOHN	549ce80f27	[AMDGPU][NFC] Add test for 64-bit lshr with shifts >=32 (#138281 ) Record current results for 64-bit lshr with shifts >=32. --------- Signed-off-by: John Lu <John.Lu@amd.com>	2025-06-06 17:04:54 -04:00
LU-JOHN	3bbb49610e	[AMDGPU][NFC] Add tests for 64-bit ashr with shifts >= 32 (#142463 ) Record current results for 64-bit ashr with shifts >=32. Signed-off-by: John Lu <John.Lu@amd.com>	2025-06-06 17:04:45 -04:00
Stanislav Mekhanoshin	19e2fd5e75	[AMDGPU] Patterns for <2 x bfloat> fneg (fabs) (#142911 )	2025-06-05 10:00:24 -07:00
Stanley Gambarin	33974b41c7	[GlobalISel] support lowering of G_SHUFFLEVECTOR with pointer args (#141959 )	2025-06-05 09:13:51 -07:00
Brox Chen	d8b245741d	[AMDGPUI][True16][CodeGen] global atomic load i8 in true16 mode (#142822 ) Update codegen pattern for global atomic load i8 with d16 instructions	2025-06-05 11:35:18 -04:00
Nick Sarnie	3b9ebe9201	[clang] Simplify device kernel attributes (#137882 ) We have multiple different attributes in clang representing device kernels for specific targets/languages. Refactor them into one attribute with different spellings to make it more easily scalable for new languages/targets. --------- Signed-off-by: Sarnie, Nick <nick.sarnie@intel.com>	2025-06-05 14:15:38 +00:00
Harrison Hao	b2379bd5d5	[AMDGPU] Support bottom-up postRA scheduing. (#135295 ) Solely relying on top‑down scheduling can underutilize hardware, since long‑latency instructions often end up scheduled too late and their latency isn’t well hidden. Adding bottom‑up post‑RA scheduling lets us move those instructions earlier, which improves latency hiding and yields roughly a 2% performance gain on key benchmarks.	2025-06-05 22:07:06 +08:00
Stanislav Mekhanoshin	9b992f29e0	[AMDGPU] Baseline fneg-fabs.bf16.ll tests. NFC. (#142910 )	2025-06-05 02:23:23 -07:00
Stanislav Mekhanoshin	0c1c60fa63	[AMDGPU] Make <2 x bfloat> fabs legal (#142908 )	2025-06-05 02:22:09 -07:00
Stanislav Mekhanoshin	100a1d0c4c	[AMDGPU] Baseline fabs.bf16.ll tests. NFC. (#142907 )	2025-06-05 00:57:10 -07:00
Ruiling, Song	0487db1f13	MachineScheduler: Improve instruction clustering (#137784 ) The existing way of managing clustered nodes was done through adding weak edges between the neighbouring cluster nodes, which is a sort of ordered queue. And this will be later recorded as `NextClusterPred` or `NextClusterSucc` in `ScheduleDAGMI`. But actually the instruction may be picked not in the exact order of the queue. For example, we have a queue of cluster nodes A B C. But during scheduling, node B might be picked first, then it will be very likely that we only cluster B and C for Top-Down scheduling (leaving A alone). Another issue is: ``` if (!ReorderWhileClustering && SUa->NodeNum > SUb->NodeNum) std::swap(SUa, SUb); if (!DAG->addEdge(SUb, SDep(SUa, SDep::Cluster))) ``` may break the cluster queue. For example, we want to cluster nodes (order as in `MemOpRecords`): 1 3 2. 1(SUa) will be pred of 3(SUb) normally. But when it comes to (3, 2), As 3(SUa) > 2(SUb), we would reorder the two nodes, which makes 2 be pred of 3. This makes both 1 and 2 become preds of 3, but there is no edge between 1 and 2. Thus we get a broken cluster chain. To fix both issues, we introduce an unordered set in the change. This could help improve clustering in some hard case. One key reason the change causes so many test check changes is: As the cluster candidates are not ordered now, the candidates might be picked in different order from before. The most affected targets are: AMDGPU, AArch64, RISCV. For RISCV, it seems to me most are just minor instruction reorder, don't see obvious regression. For AArch64, there were some combining of ldr into ldp being affected. With two cases being regressed and two being improved. This has more deeper reason that machine scheduler cannot cluster them well both before and after the change, and the load combine algorithm later is also not smart enough. For AMDGPU, some cases have more v_dual instructions used while some are regressed. It seems less critical. Seems like test `v_vselect_v32bf16` gets more buffer_load being claused.	2025-06-05 15:28:04 +08:00
Stanislav Mekhanoshin	a56442529c	[AMDGPU] Make <2 x bfloat> fneg legal (#142870 )	2025-06-04 22:09:25 -07:00
Shilei Tian	8cd5604f59	[AMDGPU][AtomicExpand] Use full flat emulation if a target supports f64 global atomic add instruction (#142859 ) If a target supports f64 global atomic add instruction, we can also use full flat emulation.	2025-06-05 00:45:42 -04:00
Stanislav Mekhanoshin	9d41159023	[AMDGPU] Add baseline fneg.bf16.ll tests. NFC. (#142866 ) This is a copy of the fneg.f16.ll, just with type replaced. The final logic shall be the same as with f16 as these are just bit operations. --------- Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com>	2025-06-04 20:54:03 -07:00
Harrison Hao	8ca220f1dd	[NFC][AMDGPU] Add lit tests for FMA combining with freeze and nnan variants (#142628 ) `freeze` on `fmul` (without `nnan`) followed by `fadd` or `fsub` into a single `fma` is supported. This patch adds lit tests to verify the optimization behavior for both nnan and non-nnan variants.	2025-06-05 09:55:44 +08:00
Brox Chen	d2f06b2729	[AMDGPU][True16][MC][CodeGen] true16 mode for v_cvt_pk_bf8/fp8_f32 (#141881 ) Update true16/fake16 profile with v_cvt_pk_bf8/fp8_f32, keeping the vdst_in profile, and update codegen pattern. update mc test and codegen test.	2025-06-04 11:29:26 -04:00
Brox Chen	b668b6439a	[AMDGPU][True16][CodeGen] legalize 16bit and 32bit use-def chain for moveToVALU in si-fix-sgpr-lowering (#138734 ) Two changes in this patch: 1. Covered another case in legalizeOperandVALUt16 functions and the COPY lowering, when SALU16 is used by SALU32, need to insert a reg_sequence after moved to valu (previously only considered SALU32 used by SALU16 case) 2. Moved the useMI analysis into addUsersToMoveVALUList. Legalize the targetted operand when needed. Turn on frem test with true16 mode for gfx1150 which is failing before this patch. A few bitcast tests also impacted by this change with some v_mov being replaced to dual mov	2025-06-04 09:53:10 -04:00
Carl Ritson	e47e4d8ae6	[AMDGPU] SIInsertHardClause: add configurable clause length limit (#142343 ) Add command line and function attribute configuration of hard clause length limit (within hardware maximum). This allows performance tuning for shaders which benefit from smaller clauses.	2025-06-04 15:00:34 +09:00
Vigneshwar Jayakumar	b3a8c1ef3a	[AMDGPU] Bugfix for scaled MFMA parsing FP literals (#142493 ) bugfix on parsing FP literals for scale values in the scaled MFMA. Due to the change in order of operands between MCinst and parsed operands, the FP literal imms for scale values were not parsed correctly. --------- Co-authored-by: Matt Arsenault <arsenm2@gmail.com>	2025-06-03 19:27:57 -05:00
YunQiang Su	bd831372b2	expandFMINIMUMNUM_FMAXIMUMNUM: Quiet is not needed for NaN vs NaN (#139237 ) New LangRef doesn't requires quieting for NaN vs NaN, aka the result may be sNaN for sNaN vs NaN. See: https://github.com/llvm/llvm-project/pull/139228	2025-06-04 08:20:48 +08:00
Harrison Hao	0107c9333c	[DAG] canCreateUndefOrPoison – mark fneg/fadd/fsub/fmul/fdiv/frem as not poison generating (#142345 ) After revisiting the LLVM Language Reference Manual, it is confirmed that plain floating-point operations (`fneg`, `fadd`, `fsub`, `fmul`, `fdiv`, and `frem`) propagate poison but do not inherently create new poison values. Thus, `SelectionDAG::canCreateUndefOrPoison` should return `false` for these operations by default. Poison generation in FP instructions occurs only when specific fast-math flags (`nnan`, `ninf`, or the collective fast) are present, as these flags explicitly convert NaN or Inf results into poison. References: - [`fneg` instruction documentation](https://llvm.org/docs/LangRef.html#fneg-instruction) - [`fadd` instruction documentation](https://llvm.org/docs/LangRef.html#fadd-instruction) - [`fsub` instruction documentation](https://llvm.org/docs/LangRef.html#fsub-instruction) - [`fmul` instruction documentation](https://llvm.org/docs/LangRef.html#fmul-instruction) - [`fdiv` instruction documentation](https://llvm.org/docs/LangRef.html#fdiv-instruction) - [`frem` instruction documentation](https://llvm.org/docs/LangRef.html#frem-instruction) - [Fast-Math Flags documentation](https://llvm.org/docs/LangRef.html#fast-math-flags)	2025-06-03 19:21:40 +08:00
Diana Picus	130080fab1	[AMDGPU] Skip register uses in AMDGPUResourceUsageAnalysis (#133242 ) Don't count register uses when determining the maximum number of registers used by a function. Count only the defs. This is really an underestimate of the true register usage, but in practice that's not a problem because if a function uses a register, then it has either defined it earlier, or some other function that executed before has defined it. In particular, the register counts are used: 1. When launching an entry function - in which case we're safe because the register counts of the entry function will include the register counts of all callees. 2. At function boundaries in dynamic VGPR mode. In this case it's safe because whenever we set the new VGPR allocation we take into account the outgoing_vgpr_count set by the middle-end. The main advantage of doing this is that the artificial VGPR arguments used only for preserving the inactive lanes when using the llvm.amdgcn.init.whole.wave intrinsic are no longer counted. This enables us to allocate only the registers we need in dynamic VGPR mode. --------- Co-authored-by: Thomas Symalla <5754458+tsymalla@users.noreply.github.com>	2025-06-03 11:20:48 +02:00
Yingwei Zheng	1984c7539e	[ValueTracking] Do not use FMF from fcmp (#142266 ) This patch introduces an FMF parameter for `matchDecomposedSelectPattern` to pass FMF flags from select, instead of fcmp. Closes https://github.com/llvm/llvm-project/issues/137998. Closes https://github.com/llvm/llvm-project/issues/141017.	2025-06-02 18:21:14 +08:00
Harrison Hao	1a7f5f5833	[AMDGPU] Promote nestedGEP allocas to vectors (#141199 ) Supports the `nestedGEP`pattern that appears when an alloca is first indexed as an array element and then shifted with a byte‑offset GEP: ```llvm %SortedFragments = alloca [10 x <2 x i32>], addrspace(5), align 8 %row = getelementptr [10 x <2 x i32>], ptr addrspace(5) %SortedFragments, i32 0, i32 %j %elt1 = getelementptr i8, ptr addrspace(5) %row, i32 4 %val = load i32, ptr addrspace(5) %elt1 ``` The pass folds the two levels of addressing into a single vector lane index and keeps the whole object in a VGPR: ```llvm %vec = freeze <20 x i32> poison ; alloca promote <20 x i32> %idx0 = mul i32 %j, 2 ; j * 2 %idx = add i32 %idx0, 1 ; j * 2 + 1 %val = extractelement <20 x i32> %vec, i32 %idx ``` This eliminates the scratch read.	2025-06-02 16:20:14 +08:00
Matt Arsenault	ad0a52202e	AMDGPU: Improve v32f16/v32bf16 copysign handling (#142177 )	2025-05-31 08:24:51 +02:00
Matt Arsenault	3aeffcfde1	AMDGPU: Improve v16f16/v16bf16 copysign handling (#142176 )	2025-05-31 08:18:52 +02:00
Matt Arsenault	ffee01e748	AMDGPU: Improve v8f16/v8bf16 copysign handling (#142175 )	2025-05-31 08:15:45 +02:00
Matt Arsenault	20ad4209dd	AMDGPU: Improve v4f16/v4bf16 copysign handling (#142174 )	2025-05-31 08:09:51 +02:00
Matt Arsenault	4aa4005e04	AMDGPU: Make copysign with matching v2f16/v2bf16 inputs legal (#142173 ) Fixes #141931	2025-05-31 08:06:49 +02:00
Shilei Tian	4d48673562	Reapply "Reapply "[AMDGPU] Make `getAssumedAddrSpace` return AS1 for pointer kernel arguments (#137488 )"" This reverts commit 37ea3b32cdcb6c0dcecbcc4bf844f5190c7378dd.	2025-05-30 22:11:22 -04:00
Shilei Tian	37ea3b32cd	Revert "Reapply "[AMDGPU] Make `getAssumedAddrSpace` return AS1 for pointer kernel arguments (#137488 )"" This reverts commit 4efc13f8ff1eaf4f9fb1fcea8d4552b3eca052ca.	2025-05-30 22:06:16 -04:00
Shilei Tian	4efc13f8ff	Reapply "[AMDGPU] Make `getAssumedAddrSpace` return AS1 for pointer kernel arguments (#137488 )" This reverts commit 3c6211c183885afb5d89259a53c4f4f46a6bf399.	2025-05-30 21:56:24 -04:00
Shilei Tian	3c6211c183	Revert "[AMDGPU] Make `getAssumedAddrSpace` return AS1 for pointer kernel arguments (#137488 )" This reverts commit 9bf6b2a8cb0467b62173659306e43a0346f063a2.	2025-05-30 21:15:25 -04:00
Shilei Tian	9bf6b2a8cb	[AMDGPU] Make `getAssumedAddrSpace` return AS1 for pointer kernel arguments (#137488 )	2025-05-30 17:30:42 -04:00
Matt Arsenault	6a6aec6f4e	AMDGPU: Handle vectors in copysign sign type combine (#142157 ) This avoids some ugly codegen on pre-16-bit instruction targets now from annoying f16 legalization effects. This also avoids regressions on newer targets in a future patch.	2025-05-30 20:02:07 +02:00
Matt Arsenault	e39e99022a	AMDGPU: Handle vectors in copysign magnitude sign case (#142156 )	2025-05-30 19:58:55 +02:00
Matt Arsenault	ba4f4a1a18	AMDGPU: Add more f16 copysign tests (#142115 )	2025-05-30 19:56:15 +02:00
Matt Arsenault	c9cca5cdc4	AMDGPU: Move bf16 copysign tests to separate file (#142114 ) Make symmetric with other copysign tests	2025-05-30 19:52:56 +02:00
Matt Arsenault	d11f9d45e4	AMDGPU: Avoid using kernels in f16 copysign test (#142113 ) Avoid the memory noise in tests that predate function support.	2025-05-30 19:49:45 +02:00
Jay Foad	f8d3bdf6a2	[AMDGPU] Fix SIFixSGPRCopies handling of STRICT_WWM and friends (#142122 ) SIFixSGPRCopies handled STRICT_WWM (and similar WWM/WQM pseudos) like a COPY. In particular, if the source was a VGPR and the result was an SGPR, lowerVGPR2SGPRCopies would replace it with a readfirstlane, erasing the original pseudo and hence sabotaging the WWM region marking which is supposed to be performed by SIWholeQuadMode. Fix this by handling it more like INSERT_SUBREG, PHI and REG_SEQUENCE: if the source is a VGPR then move the result to a VGPR, and keep the pseudo.	2025-05-30 16:32:56 +01:00
Daniil Fukalov	5208f722d8	[AMDGPU] Fix SIFoldOperandsImpl::canUseImmWithOpSel() for VOP3 packed [B]F16 imms. (#142142 ) VOP3 instructions ignore opsel source modifiers, so a constant that contains two different [B]F16 imms cannot be encoded into instruction with an src opsel. E.g. without the fix the following instructions `s_mov_b32 s0, 0x40003c00 // <half 1.0, half 2.0>` `v_cvt_scalef32_pk_fp8_f16 v0, s0, v2` lose `2.0` imm and are folded into `v_cvt_scalef32_pk_fp8_f16 v1, 1.0, 1.0` Fixes SWDEV-531672	2025-05-30 16:38:07 +02:00
LU-JOHN	f88a9a32d9	[AMDGPU] Extend SRA i64 simplification for shift amts in range [33:62] (#138913 ) Extend sra i64 simplification to shift constants in range [33:62]. Shift amounts 32 and 63 were already handled. New testing for shift amts 33 and 62 added in sra.ll. Changes to other test files were to adapt previous test results to this extension. --------- Signed-off-by: John Lu <John.Lu@amd.com>	2025-05-30 16:21:38 +02:00
Matt Arsenault	f70e920c87	AMDGPU: Directly check if shrink-instructions run is post-RA (#142009 )	2025-05-30 15:52:18 +02:00
Matt Arsenault	a227b26d35	AMDGPU: Fix broken XFAILed test for fat pointer null initializers (#142015 ) This was failing on the buffer fat pointer lowering error in the addrspace(7) case, not the expected asm printer breakage. Also remove the attempt at FileChecking the result, since that is dependent on the actual fix and we want the unexpected pass whenever the assert is fixed.	2025-05-30 07:55:46 +02:00
Matt Arsenault	6b81483e28	AMDGPU: Start using LLVMContext errors in buffer fat pointer lowering (#142014 ) Avoid using report_fatal_error. Many more uses that should be converted in the pass remain.	2025-05-30 07:52:45 +02:00
Shilei Tian	84a69a0f8f	[AMDGPU] Move InferAddressSpacesPass to middle end optimization pipeline (#138604 ) It will run twice in the non-LTO pipeline with `O1` or higher. In LTO post link pipeline, it will be run once with `O2` or higher, since inline and SROA don't run in `O1`.	2025-05-29 17:20:56 -04:00
Matt Arsenault	cc8d253f39	AMDGPU: Handle other fmin flavors in fract combine (#141987 ) Since the input is either known not-nan, or we have explicit use code checking if the input is a nan, any of the 3 is valid to match.	2025-05-29 22:11:01 +02:00
Matt Arsenault	c569248b74	AMDGPU: Add baseline tests for fract combine with other fmin types (#141986 )	2025-05-29 22:05:00 +02:00
Matt Arsenault	3c5c0709e5	AMDGPU: Add missing fract test (#141985 ) This was missing the case where the fcmp condition and select were inverted.	2025-05-29 21:57:58 +02:00

1 2 3 4 5 ...

8758 Commits