llvm-project

Author	SHA1	Message	Date
Oliver Stannard	184d1783db	[AArch64] PAUTH_PROLOGUE should not be duplicated with PAuthLR (#124775 ) When using PAuthLR, the PAUTH_PROLOGUE expands into a sequence of instructions which takes the address of one of those instructions, and uses that address to compute the return address signature. If this is duplicated, there will be two different addresses used in calculating the signature, so the epilogue will only be correct for (at most) one of them. This change also restricts code generation when using v8.3-A return address signing, without PAuthLR. This isn't strictly needed, as duplicating the prologue there would be valid. We could fix this by having two copies of PAUTH_PROLOGUE, with and without isNotDuplicable, but I don't think it's worth adding the extra complexity to a security feature for that. (cherry picked from commit 36b3c43524c8ca86a5050496b8773f07c5ccddff)	2025-01-31 17:55:58 -08:00
Yingwei Zheng	40ca089d99	[CodeGenPrepare] Replace deleted ext instr with the promoted value. (#71058 ) This PR replaces the deleted ext with the promoted value in `AddrMode`. Fixes #70938. (cherry picked from commit 3c6aa04cf4dee65113e2a780b9f90b36bb4c4e04)	2025-01-31 17:49:18 -08:00
David Green	b23297a7f1	[GlobalISel] Do not run verifier after ResetMachineFunctionPass (#124799 ) After we fall back from GlobalISel to SDAG, the verifier gets called, which calls getReservedRegs which uses SIMachineFunctionInfo::usesAGPRs which caches the result of UsesAGPRs. Because we have just fallen-back the function is empty and it incorrectly gets cached to false. This patch makes sure we don't try to run the verifier whilst the function is empty. (cherry picked from commit 66e0498dafbfa7f8fd7deaa88ae62bdf38a12113)	2025-01-31 16:38:42 -08:00
Luke Lau	ff271d04a2	[RISCV][VLOPT] Fix assertion failure across blocks (#124734 ) Whilst adding a cross-block test, I encountered an assertion failure in the second pass where we check the instruction popped off the worklist is a candidate. The leaf instruction %c in this case will be added to the worklist when its VL is VLMAX, but during the first pass it will have its VL reduced to 1. Then in the second pass when its processed via the worklist, isCandidate will no longer be true due to its VL == 1. This fixes it by moving the VL == 1 check to tryReduceVL, keeping it alongside the other VL check for bailing out early as an optimisation.	2025-01-29 11:00:50 +08:00
yonghong-song	8aae191cb6	[BPF] Remove 'may_goto 0' instructions (#123482 ) Emil Tsalapatis from Meta reported such a case where 'may_goto 0' insn is generated by clang compiler. But 'may_goto 0' insn is actually a no-op so it makes sense to remove that in llvm. The patch is also able to handle the following code pattern ``` ... may_goto 2 may_goto 1 may_goto 0 ... ``` where three may_goto insns can all be removed. --------- Co-authored-by: Yonghong Song <yonghong.song@linux.dev>	2025-01-28 15:19:05 -08:00
Stephen Tozer	822f74a911	[Clang] Cleanup docs and comments relating to -fextend-variable-liveness (#124767 ) This patch contains a number of changes relating to the above flag; primarily it updates comment references to the old flag names, "-fextend-lifetimes" and "-fextend-this-ptr" to refer to the new names, "-fextend-variable-liveness[={all,this}]". These changes are all NFC. This patch also removes the explicit -fextend-this-ptr-liveness flag alias, and shortens the help-text for the main flag; these are both changes that were meant to be applied in the initial PR (#110000), but due to some user-error on my part they were not included in the merged commit.	2025-01-28 18:25:32 +00:00
Venkata Ramanaiah Nalamothu	a0b049055d	[RISC-V] Fix incorrect epilogue_begin setting in debug line table (#120623 ) The DwarfDebug.cpp implementation expects the epilogue instructions to have source location of last non-debug instruction after which the epilogue instructions are inserted. The epilogue_begin is set on location of the first FrameDestroy instruction with source line information that has been seen in the epilogue basic block. In the trunk, the risc-v backend sets the epilogue_begin after the epilogue has actually begun i.e. after callee saved register reloads and the source line information is not set on those reload instructions. This is leading to #120553 where, while debugging, breaking on or single stepping to the epilogue_begin location will make accessing the variables from wrong place as the FP has been restored to the parent frame's FP. To fix that, this patch sets FrameSetup/FrameDestroy flags on the callee saved register spill/reload instructions which is actually correct. Then the RISCVInstrInfo::loadRegFromStackSlot uses FrameDestroy flag to identify a reload of the callee saved register in the epilogue and copies the source line information from insert position instruction to that reload instruction. Requires PR #120622 Fixes #120553	2025-01-28 21:03:12 +05:30
Daniil Fukalov	68d90cff58	[AMDGPU][GlobalISel] Fix assert on APInt creation. (#124608 ) Since 3494ee95902cef62f767489802e469c58a13ea04 APInt stopped to implicitly truncate values, therefore it asserts on a big signed value converted to (implicitly) unsigned APInt. The change explicitly marks offset as a signed value.	2025-01-28 15:53:17 +01:00
Stephen Tozer	22687aa97b	[CodeGen] Correctly handle non-standard cases in RemoveLoadsIntoFakeUses (#111551 ) In the RemoveLoadsIntoFakeUses pass, we try to remove loads that are only used by fake uses, as well as the fake use in question. There are two existing errors with the pass however: it incorrectly examines every operand of each FAKE_USE, when only the first is relevant (extra operands will just be "killed" regs assigned by a previous pass), and it ignores cases where the FAKE_USE register is not an exact match for the loaded register, which is incorrect as regalloc may choose to load a wider value than the FAKE_USE required pre-regalloc. This patch fixes both of these cases.	2025-01-28 13:59:41 +00:00
Renat Idrisov	11db7fb09b	[GlobalISel] Catching inconsistencies in load memory, result, and range metadata type (#121247 ) This is a fix for: https://github.com/llvm/llvm-project/issues/97290 Please let me know if that is the right way to address the issue. Thank you! --------- Co-authored-by: Renat Idrisov <parsifal-47@users.noreply.github.com> Co-authored-by: Matt Arsenault <arsenm2@gmail.com>	2025-01-28 20:54:34 +07:00
abhishek-kaushik22	015aed18ee	[SelectionDAG] WidenVecOp_INSERT_SUBVECTOR - Replace `INSERT_SUBVECTOR` with series of `INSERT_VECTOR_ELT` (#124420 ) If the operands to `INSERT_SUBVECTOR` can't be widened legally, just replace the `INSERT_SUBVECTOR` with a series of `INSERT_VECTOR_ELT`. Closes #124255 (and possibly #102016)	2025-01-28 18:54:49 +05:30
Luke Lau	500a1834d9	[RISCV][VLOPT] Fix some typos in vl-opt-op-info.mir test. NFC vleN_v_incompatible_emul reassigns to %x and vsuxeiN_v_idx_incompatible_eew has a dead instruction	2025-01-28 20:28:02 +08:00
Pierre van Houtryve	8ea018ce1d	[DAGISel] Fix MMRA Handling in copyExtraInfo (#124730 ) #78569 did not implement this correctly and an edge case breaks it by triggering `Assertion `!Leafs.empty()' failed.` Fixes SWDEV-507698	2025-01-28 13:27:26 +01:00
Cullen Rhodes	8017ca1d00	Reapply "[AArch64] Combine and and lsl into ubfiz" (#123356 ) (#124576 ) Patch was reverted due to test case (added) exposing an infinite loop in combiner, where (shl C1, C2) create by performSHLCombine isn't constant-folded: Combining: t14: i64 = shl t12, Constant:i64<1> Creating new node: t36: i64 = shl OpaqueConstant:i64<-2401053089408754003>, Constant:i64<1> Creating new node: t37: i64 = shl t6, Constant:i64<1> Creating new node: t38: i64 = and t37, t36 ... into: t38: i64 = and t37, t36 ... Combining: t38: i64 = and t37, t36 Creating new node: t39: i64 = and t6, OpaqueConstant:i64<-2401053089408754003> Creating new node: t40: i64 = shl t39, Constant:i64<1> ... into: t40: i64 = shl t39, Constant:i64<1> and subsequently gets simplified by DAGCombiner::visitAND: // Simplify: (and (op x...), (op y...)) -> (op (and x, y)) if (N0.getOpcode() == N1.getOpcode()) if (SDValue V = hoistLogicOpWithSameOpcodeHands(N)) return V; before being folded by performSHLCombine once again and so on. The combine in performSHLCombine should only be done if (shl C1, C2) can be constant-folded, it may otherwise be unsafe and generally have a worse end result. Thanks to Dave Sherwood for his insight on this one. This reverts commit f719771f251d7c30eca448133fe85730f19a6bd1.	2025-01-28 11:27:34 +00:00
Csanád Hajdú	4a00c84fbb	[AArch64] Allow register offset addressing mode for prefetch (#124534 ) Previously instruction selection failed to generate PRFM instructions with register offsets because `AArch64ISD::PREFETCH` is not a `MemSDNode`.	2025-01-28 09:16:40 +00:00
Aaditya	cd57c9530b	[NFC][AMDGPU] Autogenerating test cases (#124507 )	2025-01-28 13:41:59 +05:30
Adam Yang	aab25f20f6	[HLSL][SPIRV][DXIL] Implement `WaveActiveMax` intrinsic (#123428 ) ``` - add clang builtin to Builtins.td - link builtin in hlsl_intrinsics - add codegen for spirv intrinsic and two directx intrinsics to retain signedness information of the operands in CGBuiltin.cpp - add semantic analysis in SemaHLSL.cpp - add lowering of spirv intrinsic to spirv backend in SPIRVInstructionSelector.cpp - add lowering of directx intrinsics to WaveActiveOp dxil op in DXIL.td - add test cases to illustrate passespendent pr merges. ``` Resolves #99170	2025-01-27 23:26:56 -08:00
Djordje Todorovic	0cb7636a46	[RISCV] Add MIPS extensions (#121394 ) Adding two extensions for MIPS p8700 CPU: 1. cmove (conditional move) 2. lsp (load/store pair) The official product page here: https://mips.com/products/hardware/p8700	2025-01-28 08:04:09 +01:00
Craig Topper	d4af658323	[RISCV] Support multiple memory operands in expandRV32ZdinxStore. TailMerge can create stores with multiple memory operands. We need to split all of them instead of assuming there is only one.	2025-01-27 22:10:51 -08:00
quic_hchandel	2d0688797c	[RISCV] Renaming muladdi to muliadd as per v0.5 spec. (#124237 ) muliadd is more relevant to the operation performed, i.e. multiply by immediate. The latest spec can be found at: https://github.com/quic/riscv-unified-db/releases/latest	2025-01-27 20:40:45 -08:00
Shilei Tian	6e4105574e	[NFC][AMDGPU] Improve code introduced in #124607 (#124672 )	2025-01-27 22:57:16 -05:00
Momchil Velikov	f75860f895	[AArch64] Implement NEON FP8 intrinsics for fused multiply-add (#123615 ) This patch adds the following intrinsics: * Fused multiply-add non-indexed float16x8_t vmlalbq_f16_mf8_fpm(float16x8_t, mfloat8x16_t, mfloat8x16_t, fpm_t) float16x8_t vmlaltq_f16_mf8_fpm(float16x8_t, mfloat8x16_t, mfloat8x16_t, fpm_t) float32x4_t vmlallbbq_f32_mf8_fpm(float32x4_t, mfloat8x16_t, mfloat8x16_t, fpm_t) float32x4_t vmlallbtq_f32_mf8_fpm(float32x4_t, mfloat8x16_t, mfloat8x16_t, fpm_t) float32x4_t vmlalltbq_f32_mf8_fpm(float32x4_t, mfloat8x16_t, mfloat8x16_t, fpm_t) float32x4_t vmlallttq_f32_mf8_fpm(float32x4_t, mfloat8x16_t, mfloat8x16_t, fpm_t) * Floating-point multiply-add long to half-precision (vector, by element) float16x8_t vmlalbq_lane_f16_mf8_fpm(float16x8_t vd, mfloat8x16_t vn, mfloat8x8_t vm, __builtin_constant_p(lane), fpm_t fpm) float16x8_t vmlalbq_laneq_f16_mf8_fpm(float16x8_t vd, mfloat8x16_t vn, mfloat8x16_t vm, __builtin_constant_p(lane), fpm_t fpm) float16x8_t vmlaltq_lane_f16_mf8_fpm(float16x8_t vd, mfloat8x16_t vn, mfloat8x8_t vm, __builtin_constant_p(lane), fpm_t fpm) float16x8_t vmlaltq_laneq_f16_mf8_fpm(float16x8_t vd, mfloat8x16_t vn, mfloat8x16_t vm, __builtin_constant_p(lane), fpm_t fpm) * Floating-point multiply-add long-long to single-precision (vector, by element) float32x4_t vmlallbbq_lane_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn, mfloat8x8_t vm, __builtin_constant_p(lane), fpm_t fpm) float32x4_t vmlallbbq_laneq_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn, mfloat8x16_t vm, __builtin_constant_p(lane), fpm_t fpm) float32x4_t vmlallbtq_lane_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn, mfloat8x8_t vm, __builtin_constant_p(lane), fpm_t fpm) float32x4_t vmlallbtq_laneq_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn, mfloat8x16_t vm, __builtin_constant_p(lane), fpm_t fpm) float32x4_t vmlalltbq_lane_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn, mfloat8x8_t vm, __builtin_constant_p(lane), fpm_t fpm) float32x4_t vmlalltbq_laneq_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn, mfloat8x16_t vm, __builtin_constant_p(lane), fpm_t fpm) float32x4_t vmlallttq_lane_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn, mfloat8x8_t vm, __builtin_constant_p(lane), fpm_t fpm) float32x4_t vmlallttq_laneq_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn, mfloat8x16_t vm, __builtin_constant_p(lane), fpm_t fpm)	2025-01-28 00:38:44 +00:00
Shilei Tian	3b2b7ec07d	[AMDGPU] Handle invariant marks in `AMDGPUPromoteAllocaPass` (#124607 ) Fixes SWDEV-509327.	2025-01-27 17:30:50 -05:00
David Green	5a81a559d6	[GISel] Explicitly disable BF16 tablegen patterns. (#124113 ) We currently have an issue where bf16 patters can be used to match fp16 types, as GISel does not know about the difference between the two. This patch explicitly disables them to make sure that they are never used. The opposite can also happen too, where fp16 patterns are used for operators that should be bf16. So this also changes any operations with bf16 types to now cause a fallback to SDAG. The pass setup for GISel has been slightly adjusted to make sure that a verify pass does not get added between AMD-SDAG and SIFixSGPRCopiesPass, which otherwise can cause verifier issues when falling back.	2025-01-27 22:21:12 +00:00
Momchil Velikov	804b81d39f	[AArch64] Add FP8 Neon intrinsics for dot-product (#123613 ) This patch adds the following intrinsics: float16x4_t vdot_f16_mf8_fpm(float16x4_t vd, mfloat8x8_t vn, mfloat8x8_t vm, fpm_t fpm) float16x8_t vdotq_f16_mf8_fpm(float16x8_t vd, mfloat8x16_t vn, mfloat8x16_t vm, fpm_t fpm) float16x4_t vdot_lane_f16_mf8_fpm(float16x4_t vd, mfloat8x8_t vn, mfloat8x8_t vm, __builtin_constant_p(lane), fpm_t fpm) float16x4_t vdot_laneq_f16_mf8_fpm(float16x4_t vd, mfloat8x8_t vn, mfloat8x16_t vm, __builtin_constant_p(lane), fpm_t fpm) float16x8_t vdotq_lane_f16_mf8_fpm(float16x8_t vd, mfloat8x16_t vn, mfloat8x8_t vm, __builtin_constant_p(lane), fpm_t fpm) float16x8_t vdotq_laneq_f16_mf8_fpm(float16x8_t vd, mfloat8x16_t vn, mfloat8x16_t vm, __builtin_constant_p(lane), fpm_t fpm) float32x2_t vdot_f32_mf8_fpm(float32x2_t vd, mfloat8x8_t vn, mfloat8x8_t vm, fpm_t fpm) float32x4_t vdotq_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn, mfloat8x16_t vm, fpm_t fpm) float32x2_t vdot_lane_f32_mf8_fpm(float32x2_t vd, mfloat8x8_t vn, mfloat8x8_t vm, __builtin_constant_p(lane), fpm_t fpm) float32x2_t vdot_laneq_f32_mf8_fpm(float32x2_t vd, mfloat8x8_t vn, mfloat8x16_t vm, __builtin_constant_p(lane), fpm_t fpm) float32x4_t vdotq_lane_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn, mfloat8x8_t vm, __builtin_constant_p(lane), fpm_t fpm) float32x4_t vdotq_laneq_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn, mfloat8x16_t vm, __builtin_constant_p(lane), fpm_t fpm)	2025-01-27 21:14:16 +00:00
Craig Topper	aa34a6ab29	[RISCV] Add register allocation hints for lui/auipc+addi fusion. (#123860 ) Spotted the auipc case while looking at code for P550. I'm not sure this is the right long term fix. We're still missing rematerialization opportunities for these pairs so a pseudo might be better. That would interfere with folding auipc+add into load/store addressing though. Fixes #76779.	2025-01-27 11:16:22 -08:00
Heejin Ahn	539b2e0654	[WebAssembly] Fix catch block type in wasm64 (#124381 ) `try_table`'s `catch` or `catch_ref`'s target block's return type should be `i64` and `(i64, exnref)` in case of wasm64.	2025-01-27 11:01:48 -08:00
Jeffrey Byrnes	e77d428e46	[AMDGPU] Do not remat instructions with PhysReg uses (#124366 ) This blocks rematerialization during scheduling if the instruction has a non accepted PhysReg use. Currently, there aren't any checks like this in place, and we may create invalid code: https://godbolt.org/z/xjPjdcorf	2025-01-27 10:50:06 -08:00
Brox Chen	d1139b32d2	[AMDGPU][True16][CodeGen] true16 codegen pats for v_mad_u16 (#124000 ) true16 codegen pats for v_mad_u16 (mul+add)	2025-01-27 13:47:17 -05:00
Momchil Velikov	5d6d982df6	[AArch64] Generate zeroing forms of certain SVE2.2 instructions (11/11) (#116837 ) SVE2.2 introduces instructions with predicated forms with zeroing of the inactive lanes. This allows in some cases to save a `movprfx` or a `mov` instruction when emitting code for `_x` or `_z` variants of intrinsics. This patch adds support for emitting the zeroing forms of certain `SXTB`, `UXTB`, `SXTH`, `UXTH`, `SXTW`, and `UXTW` instructions.	2025-01-27 18:12:00 +00:00
Momchil Velikov	99bd2e3f12	[AArch64] Add Neon FP8 conversion intrinsics (#123612 ) The patch adds the following intrinsics: bfloat16x8_t vcvt1_bf16_mf8_fpm(mfloat8x8_t vn, fpm_t fpm) bfloat16x8_t vcvt1_low_bf16_mf8_fpm(mfloat8x16_t vn, fpm_t fpm) bfloat16x8_t vcvt2_bf16_mf8_fpm(mfloat8x8_t vn, fpm_t fpm) bfloat16x8_t vcvt2_low_bf16_mf8_fpm(mfloat8x16_t vn, fpm_t fpm) bfloat16x8_t vcvt1_high_bf16_mf8_fpm(mfloat8x16_t vn, fpm_t fpm) bfloat16x8_t vcvt2_high_bf16_mf8_fpm(mfloat8x16_t vn, fpm_t fpm) float16x8_t vcvt1_f16_mf8_fpm(mfloat8x8_t vn, fpm_t fpm) float16x8_t vcvt1_low_f16_mf8_fpm(mfloat8x16_t vn, fpm_t fpm) float16x8_t vcvt2_f16_mf8_fpm(mfloat8x8_t vn, fpm_t fpm) float16x8_t vcvt2_low_f16_mf8_fpm(mfloat8x16_t vn, fpm_t fpm) float16x8_t vcvt1_high_f16_mf8_fpm(mfloat8x16_t vn, fpm_t fpm) float16x8_t vcvt2_high_f16_mf8_fpm(mfloat8x16_t vn, fpm_t fpm) mfloat8x8_t vcvt_mf8_f32_fpm(float32x4_t vn, float32x4_t vm, fpm_t fpm) mfloat8x16_t vcvt_high_mf8_f32_fpm(mfloat8x8_t vd, float32x4_t vn, float32x4_t vm, fpm_t fpm) mfloat8x8_t vcvt_mf8_f16_fpm(float16x4_t vn, float16x4_t vm, fpm_t fpm) mfloat8x16_t vcvtq_mf8_f16_fpm(float16x8_t vn, float16x8_t vm, fpm_t fpm) Co-Authored-By: Caroline Concatto <caroline.concatto@arm.com>	2025-01-27 17:32:47 +00:00
Momchil Velikov	4e231014c1	[AArch64] Generate zeroing forms of certain SVE2.2 instructions (10/11) (#116836 ) SVE2.2 introduces instructions with predicated forms with zeroing of the inactive lanes. This allows in some cases to save a `movprfx` or a `mov` instruction when emitting code for `_x` or `_z` variants of intrinsics. This patch adds support for emitting the zeroing forms of certain `RBIT`, `REVB`, `REVH`, `REVW`, and `REVD` instructions.	2025-01-27 16:45:40 +00:00
Luke Lau	cb6f021af2	[RISCV][VLOPT] Remove unnecessary passthru restriction (#124549 ) We currently check for passthrus in two places, on the instruction to reduce in isCandidate, and on the users in checkUsers. We cannot reduce the VL if an instruction has a user that's a passthru, because the user will read elements past VL in the tail. However it's fine to reduce an instruction if it itself contains a non-undef passthru. Since the VL can only be reduced, not increased, the previous tail will always remain the same.	2025-01-27 23:54:32 +08:00
Momchil Velikov	f95f10c7e6	[AArch64] Generate zeroing forms of certain SVE2.2 instructions (9/11) (#116835 ) SVE2.2 introduces instructions with predicated forms with zeroing of the inactive lanes. This allows in some cases to save a `movprfx` or a `mov` instruction when emitting code for `_x` or `_z` variants of intrinsics. This patch adds support for emitting the zeroing forms of certain `URECPE`, `URSQRTE`, `SQABS` and `SQNEG` instructions.	2025-01-27 15:50:53 +00:00
Simon Pilgrim	86705eb624	[X86] huge-stack-offset.ll - add gnux32 test coverage This should match x86 for the basic implementation, but its useful to check it actually runs correctly.	2025-01-27 14:10:16 +00:00
David Green	ef54e0bbfb	[AArch64] Avoid generating LDAPUR on certain cores (#124274 ) On the CPUs listed below, we want to avoid LDAPUR for performance reasons. Add a tuning feature to disable them when using: -mcpu=neoverse-v2 -mcpu=neoverse-v3 -mcpu=cortex-x3 -mcpu=cortex-x4 -mcpu=cortex-x925	2025-01-27 13:12:11 +00:00
Momchil Velikov	d8ad1eef8f	[AArch64] Generate zeroing forms of certain SVE2.2 instructions (7/11) (#116833 ) SVE2.2 introduces instructions with predicated forms with zeroing of the inactive lanes. This allows in some cases to save a `movprfx` or a `mov` instruction when emitting code for `_x` or `_z` variants of intrinsics. This patch adds support for emitting the zeroing forms of certain `FLOGB` instructions.	2025-01-27 12:53:38 +00:00
Durgadoss R	3b5e9eed2f	[NVPTX] Add float to tf32 conversion intrinsics (#124316 ) This patch adds the set of f32 -> tf32 cvt intrinsics introduced in sm100 with ptx8.6. This builds on top of the recent PR #121507. Tests are verified with a 12.8 ptxas executable. PTX ISA link: https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cvt Signed-off-by: Durgadoss R <durgadossr@nvidia.com>	2025-01-27 15:52:43 +05:30
Momchil Velikov	bd38c4993a	[AArch64] Generate zeroing forms of certain SVE2.2 instructions (8/11) (#116834 ) SVE2.2 introduces instructions with predicated forms with zeroing of the inactive lanes. This allows in some cases to save a `movprfx` or a `mov` instruction when emitting code for `_x` or `_z` variants of intrinsics. This patch adds support for emitting the zeroing forms of certain `FRINTx`, `FRECPX`, and `FSQRT` instructions.	2025-01-27 09:21:56 +00:00
YunQiang Su	bfa7de0df5	X86: Support FCANONICALIZE on f64/f80 for i686 with SSE2 or AVX (#123917 ) Currently, FCANONICALIZE is not enabled for f64 with SSE2, and is not enabled for f80 for 32bit system. Let's enable them.	2025-01-27 07:27:26 +08:00
mconst	1f26ac10ca	[X86] Better handling of impossibly large stack frames (#124217 ) If you try to create a stack frame of 4 GiB or larger with a 32-bit stack pointer, we currently emit invalid instructions like `mov eax, 5000000000` (unless you specify `-fstack-clash-protection`, in which case we emit a trap instead). The trap seems nicer, so let's do that in all cases. This avoids emitting invalid instructions, and also fixes the "can't have 32-bit 16GB stack frame" assertion in `X86FrameLowering::emitSPUpdate()` (which used to be triggerable by user code, but is now correct). This was originally part of #124041. @phoebewang	2025-01-25 12:03:57 +08:00
Hiroshi Yamauchi	425d25f5df	[AArch64][WinCFI] Fix a crash due to missing seh directives (#123993 ) https://github.com/llvm/llvm-project/issues/123808	2025-01-24 14:01:41 -08:00
Philip Reames	a9ad601f7c	[RISCV] Use vrsub for select of add and sub of the same operands (#123400 ) If we have a (vselect c, a+b, a-b), we can combine this to a+(vselect c, b, -b). That by itself isn't hugely profitable, but if we reverse the select, we get a form which matches a masked vrsub.vi with zero. The result is that we can use a masked vrsub before the add instead of a masked add or sub. This doesn't change the critical path (since we already had the pass through on the masked second op), but does reduce register pressure since a, b, and (a+b) don't need to all be alive at once. In addition to the vselect form, we can also see the same pattern with a vector_shuffle encoding the vselect. I explored canonicalizing these to vselects instead, but that exposes several unrelated missing combines.	2025-01-24 10:08:42 -08:00
Brox Chen	ec66c4af09	[AMDGPU][True16][CodeGen] true16 codegen pattern for f16 canonicalize (#122000 ) true16 codegen pattern for f16 canonicalize	2025-01-24 10:44:00 -05:00
Nikita Popov	2068b1ba03	[X86] Fix ABI for passing after i128 (#124134 ) If we're passing an i128 value and we no longer have enough argument registers (only r9 unallocated), the value gets passed via the stack. However, r9 is still allocated as a shadow register, which means that a following i64 argument will not use it. This doesn't match the x86-64 psABI. Fix this by making i128 arguments as requiring consecutive registers, and then adding a custom CC lowering that will allocate both parts of the i128 at the same time, either to register or to stack, without reserving a shadow register. Fixes https://github.com/llvm/llvm-project/issues/123935.	2025-01-24 15:31:53 +01:00
Aaditya	11b0401926	[AMDGPU] Restore SP from saved-FP or saved-BP (#124007 ) Currently, the AMDGPU backend bumps the Stack Pointer by fixed size offsets in the prolog of device functions, and restores it by the same amount in the epilog. Prolog: sp += frameSize Epilog: sp -= frameSize If a function has dynamic stack realignment, Prolog: sp += frameSize + max_alignment Epilog: sp -= frameSize + max_alignment These calculations are not optimal in case of dynamic stack realignment, and completely fail in case of dynamic stack readjustment. This patch uses the saved Frame Pointer to restore SP. Prolog: fp = sp sp += frameSize Epilog: sp = fp In case of dynamic stack realignment, SP is restored from the saved Base Pointer. Prolog: fp = sp + (max_alignment - 1) fp = fp & (-max_alignment) bp = sp sp += frameSize + max_alignment Epilog: sp = bp (Note: The presence of BP has been enforced in case of any dynamic stack realignment.) --------- Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com> Co-authored-by: Matt Arsenault <arsenm2@gmail.com>	2025-01-24 19:13:40 +05:30
yingopq	d6e0798a2a	[Mips] Add the missing judgment when processing function handleMFLOSlot (#121463 ) In function handleMFLOSlot, we may get a variable LastInstInFunction with a value of true from function getNextMachineInstr and IInSlot may be null which would trigger an assert. So we need to skip this case. Fix #118223.	2025-01-24 20:03:03 +08:00
Petar Avramovic	b60c118f53	MachineUniformityAnalysis: Improve isConstantOrUndefValuePhi (#112866 ) Change existing code for G_PHI to match what LLVM-IR version is doing via PHINode::hasConstantOrUndefValue. This is not safe for regular PHI since it may appear with an undef operand and getVRegDef can fail. Most notably this improves number of values that can be allocated to sgpr in AMDGPURegBankSelect. Common case here are phis that appear in structurize-cfg lowering for cycles with multiple exits: Undef incoming value is coming from block that reached cycle exit condition, if other incoming is uniform keep the phi uniform despite the fact it is joining values from pair of blocks that are entered via divergent condition branch.	2025-01-24 12:43:40 +01:00
Petar Avramovic	4831fa8632	AMDGPU/GlobalISel: RegBankLegalize rules for load (#112882 ) Add IDs for bit width that cover multiple LLTs: B32 B64 etc. "Predicate" wrapper class for bool predicate functions used to write pretty rules. Predicates can be combined using &&, \|\| and !. Lowering for splitting and widening loads. Write rules for loads to not change existing mir tests from old regbankselect.	2025-01-24 12:36:41 +01:00
Petar Avramovic	0ee037b861	AMDGPU/GlobalISel: AMDGPURegBankLegalize (#112864 ) Lower G_ instructions that can't be inst-selected with register bank assignment from AMDGPURegBankSelect based on uniformity analysis. - Lower instruction to perform it on assigned register bank - Put uniform value in vgpr because SALU instruction is not available - Execute divergent instruction in SALU - "waterfall loop" Given LLTs on all operands after legalizer, some register bank assignments require lowering while other do not. Note: cases where all register bank assignments would require lowering are lowered in legalizer. AMDGPURegBankLegalize goals: - Define Rules: when and how to perform lowering - Goal of defining Rules it to provide high level table-like brief overview of how to lower generic instructions based on available target features and uniformity info (uniform vs divergent). - Fast search of Rules, depends on how complicated Rule.Predicate is - For some opcodes there would be too many Rules that are essentially all the same just for different combinations of types and banks. Write custom function that handles all cases. - Rules are made from enum IDs that correspond to each operand. Names of IDs are meant to give brief description what lowering does for each operand or the whole instruction. - AMDGPURegBankLegalizeHelper implements lowering algorithms Since this is the first patch that actually enables -new-reg-bank-select here is the summary of regression tests that were added earlier: - if instruction is uniform always select SALU instruction if available - eliminate back to back vgpr to sgpr to vgpr copies of uniform values - fast rules: small differences for standard and vector instruction - enabling Rule based on target feature - salu_float - how to specify lowering algorithm - vgpr S64 AND to S32 - on G_TRUNC in reg, it is up to user to deal with truncated bits G_TRUNC in reg is treated as no-op. - dealing with truncated high bits - ABS S16 to S32 - sgpr S1 phi lowering - new opcodes for vcc-to-scc and scc-to-vcc copies - lowering for vgprS1-to-vcc copy (formally this is vgpr-to-vcc G_TRUNC) - S1 zext and sext lowering to select - uniform and divergent S1 AND(OR and XOR) lowering - inst-selected into SALU instruction - divergent phi with uniform inputs - divergent instruction with temporal divergent use, source instruction is defined as uniform(AMDGPURegBankSelect) - missing temporal divergence lowering - uniform phi, because of undef incoming, is assigned to vgpr. Will be fixed in AMDGPURegBankSelect via another fix in machine uniformity analysis.	2025-01-24 12:12:45 +01:00

1 2 3 4 5 ...

57164 Commits