llvm-project

Author	SHA1	Message	Date
Matt Arsenault	1cbfac04d0	SystemZ: Handle copies between gr64 and fp64 (#124890 ) I'm guessing based on tablegen definitions. I also don't really understand how this could have been missing. This defends against regressions in a future peephole-opt patch.	2025-01-30 11:08:08 +07:00
Matt Arsenault	6017480461	MachineVerifier: Fix check for range type (#124894 ) We need to permit scalar extending loads with range annotations. Fix expensive_checks failures after 11db7fb09b36e656a801117d6a2492133e9c2e46	2025-01-30 10:56:12 +07:00
Matt Arsenault	97a1f494a6	DAG: Avoid breaking legal vector_shuffle with multiple uses (#123712 ) Previously this combine would undo AMDGPU's new custom legalization of wide vector shuffles into 2 element pieces. The comment also states that this combine is only done before legalization, but the case with a build_vector source was unconditional. We probably don't want to do this if the multiple uses are full scalarization of the vector, but this seems to work well enough. Scalarizing extracts should have folded out pre-legalize.	2025-01-30 10:55:21 +07:00
Carl Ritson	a3a3e6997b	[AMDGPU] Rewrite GFX12 SGPR hazard handling to dedicated pass (#118750 ) - Algorithm operates over whole IR to attempt to minimize waits. - Add support for VALU->VALU SGPR hazards via VA_SDST/VA_VCC.	2025-01-30 11:21:11 +09:00
Yingwei Zheng	3c6aa04cf4	[CodeGenPrepare] Replace deleted ext instr with the promoted value. (#71058 ) This PR replaces the deleted ext with the promoted value in `AddrMode`. Fixes #70938.	2025-01-30 08:58:23 +08:00
Alex MacLean	de7438e472	[NVPTX] Auto-Upgrade some nvvm.annotations to attributes (#119261 ) Add a new AutoUpgrade function to convert some legacy nvvm.annotations metadata to function level attributes. These attributes are quicker to look-up so improve compile time and are more idiomatic than using metadata which should not include required information that changes the meaning of the program. Currently supported annotations are: - !"kernel" -> ptx_kernel calling convention - !"align" -> alignstack parameter attributes (return not yet supported)	2025-01-29 16:27:27 -08:00
Konstantina Mitropoulou	9adc99bcc5	[AMDGPU] Always emit SI_KILL_I1_PSEUDO for uniform floating point branches. (#124028 ) - [NFC] Use GCNPat instead of Pat. - [AMDGPU] Always emit SI_KILL_I1_PSEUDO for uniform floating point branches. --------- Co-authored-by: Konstantina Mitropoulou <KonstantinaMitropoulou@amd.com>	2025-01-29 09:00:40 -08:00
Nikita Popov	29441e4f5f	[IR] Convert from nocapture to captures(none) (#123181 ) This PR removes the old `nocapture` attribute, replacing it with the new `captures` attribute introduced in #116990. This change is intended to be essentially NFC, replacing existing uses of `nocapture` with `captures(none)` without adding any new analysis capabilities. Making use of non-`none` values is left for a followup. Some notes: * `nocapture` will be upgraded to `captures(none)` by the bitcode reader. * `nocapture` will also be upgraded by the textual IR reader. This is to make it easier to use old IR files and somewhat reduce the test churn in this PR. * Helper APIs like `doesNotCapture()` will check for `captures(none)`. * MLIR import will convert `captures(none)` into an `llvm.nocapture` attribute. The representation in the LLVM IR dialect should be updated separately.	2025-01-29 16:56:47 +01:00
Mikhail Gudim	3c3c850a45	[ReachingDefAnalysis] Extend the analysis to stack objects. (#118097 ) We track definitions of stack objects, the implementation is identical to tracking of registers. Also, added printing of all found reaching definitions for testing purposes. --------- Co-authored-by: Michael Maitland <michaeltmaitland@gmail.com>	2025-01-29 10:55:16 -05:00
Acim Maravic	3a29dfe37c	[LLVM][AMDGPU] Add Intrinsic and Builtin for ds_bpermute_fi_b32 (#124616 )	2025-01-29 14:04:10 +01:00
David Green	66e0498daf	[GlobalISel] Do not run verifier after ResetMachineFunctionPass (#124799 ) After we fall back from GlobalISel to SDAG, the verifier gets called, which calls getReservedRegs which uses SIMachineFunctionInfo::usesAGPRs which caches the result of UsesAGPRs. Because we have just fallen-back the function is empty and it incorrectly gets cached to false. This patch makes sure we don't try to run the verifier whilst the function is empty.	2025-01-29 12:48:11 +00:00
Simon Pilgrim	9534d27e33	[X86] vector-idiv-sdiv-512.ll - regenerate VPTERNLOG comments	2025-01-29 11:34:44 +00:00
Oliver Stannard	36b3c43524	[AArch64] PAUTH_PROLOGUE should not be duplicated with PAuthLR (#124775 ) When using PAuthLR, the PAUTH_PROLOGUE expands into a sequence of instructions which takes the address of one of those instructions, and uses that address to compute the return address signature. If this is duplicated, there will be two different addresses used in calculating the signature, so the epilogue will only be correct for (at most) one of them. This change also restricts code generation when using v8.3-A return address signing, without PAuthLR. This isn't strictly needed, as duplicating the prologue there would be valid. We could fix this by having two copies of PAUTH_PROLOGUE, with and without isNotDuplicable, but I don't think it's worth adding the extra complexity to a security feature for that.	2025-01-29 10:42:47 +00:00
Mingming Liu	3feb724496	[AsmPrinter][ELF] Support profile-guided section prefix for jump tables' (read-only) data sections (#122215 ) https://github.com/llvm/llvm-project/pull/122183 adds a codegen pass to infer machine jump table entry's hotness from the MBB hotness. This is a follow-up PR to produce `.hot` and or `.unlikely` section prefix for jump table's (read-only) data sections in the relocatable `.o` files. When this patch is enabled, linker will see {`.rodata`, `.rodata.hot`, `.rodata.unlikely`} in input sections. It can map `.rodata.hot` and `.rodata` in the input sections to `.rodata.hot` in the executable, and map `.rodata.unlikely` into `.rodata` with a pending extension to `--keep-text-section-prefix` like `059e7cbb66`, or with a linker script. 1. To partition hot and jump tables, the AsmPrinter pass slices a function's jump table indices into two groups, one for hot and the other for cold jump tables. It then emits hot jump tables into a `.hot`-prefixed data section and cold ones into a `.unlikely`-prefixed data section, retaining the relative order of `LJT<N>` labels within each group. 2. [ELF only] To have data sections with _dynamic_ names (e.g., `.rodata.hot[.func]`), we implement `TargetLoweringObjectFile::getSectionForJumpTable` method that accepts a `MachineJumpTableEntry` parameter, and update `selectELFSectionForGlobal` to generate `.hot` or `.unlikely` based on MJTE's hotness. - The dynamic JT section name doesn't depend on `-ffunction-section=true` or `-funique-section-names=true`, even though it leverages the similar underlying mechanism to have a MCSection with on-demand name as `-ffunction-section` does. 3. The new code path is off by default. - Typically, `TargetOptions` conveys clang or LLVM tools' options to code generation passes. To follow the pattern, add option `EnableStaticDataPartitioning` bit in `TargetOptions` and make it readable through `TargetMachine`. - To enable the new code path in tools like `llc`, `partition-static-data-sections` option is introduced in `CodeGen/CommandFlags.h/cpp`. - A subsequent patch ([draft](`8f36a13743`)) will add a clang option to enable the new code path. --------- Co-authored-by: Ellis Hoag <ellis.sparky.hoag@gmail.com>	2025-01-28 22:49:28 -08:00
Luke Lau	8675cd3fac	[RISCV][VLOPT] Compute demanded VLs up front (#124530 ) This replaces the worklist by instead computing what VL is demanded by each instruction's users first, which is done via checkUsers. The demanded VLs are stored in a DenseMap, and then we can just do a single forward pass of tryReduceVL where we check if a candidate's demanded VL is less than its VLOp. This means the pass should now be linear in complexity, and allows us to relax the restriction on tied operands in more easily as in #124066.	2025-01-29 12:39:38 +08:00
Luke Lau	ff271d04a2	[RISCV][VLOPT] Fix assertion failure across blocks (#124734 ) Whilst adding a cross-block test, I encountered an assertion failure in the second pass where we check the instruction popped off the worklist is a candidate. The leaf instruction %c in this case will be added to the worklist when its VL is VLMAX, but during the first pass it will have its VL reduced to 1. Then in the second pass when its processed via the worklist, isCandidate will no longer be true due to its VL == 1. This fixes it by moving the VL == 1 check to tryReduceVL, keeping it alongside the other VL check for bailing out early as an optimisation.	2025-01-29 11:00:50 +08:00
yonghong-song	8aae191cb6	[BPF] Remove 'may_goto 0' instructions (#123482 ) Emil Tsalapatis from Meta reported such a case where 'may_goto 0' insn is generated by clang compiler. But 'may_goto 0' insn is actually a no-op so it makes sense to remove that in llvm. The patch is also able to handle the following code pattern ``` ... may_goto 2 may_goto 1 may_goto 0 ... ``` where three may_goto insns can all be removed. --------- Co-authored-by: Yonghong Song <yonghong.song@linux.dev>	2025-01-28 15:19:05 -08:00
Stephen Tozer	822f74a911	[Clang] Cleanup docs and comments relating to -fextend-variable-liveness (#124767 ) This patch contains a number of changes relating to the above flag; primarily it updates comment references to the old flag names, "-fextend-lifetimes" and "-fextend-this-ptr" to refer to the new names, "-fextend-variable-liveness[={all,this}]". These changes are all NFC. This patch also removes the explicit -fextend-this-ptr-liveness flag alias, and shortens the help-text for the main flag; these are both changes that were meant to be applied in the initial PR (#110000), but due to some user-error on my part they were not included in the merged commit.	2025-01-28 18:25:32 +00:00
Venkata Ramanaiah Nalamothu	a0b049055d	[RISC-V] Fix incorrect epilogue_begin setting in debug line table (#120623 ) The DwarfDebug.cpp implementation expects the epilogue instructions to have source location of last non-debug instruction after which the epilogue instructions are inserted. The epilogue_begin is set on location of the first FrameDestroy instruction with source line information that has been seen in the epilogue basic block. In the trunk, the risc-v backend sets the epilogue_begin after the epilogue has actually begun i.e. after callee saved register reloads and the source line information is not set on those reload instructions. This is leading to #120553 where, while debugging, breaking on or single stepping to the epilogue_begin location will make accessing the variables from wrong place as the FP has been restored to the parent frame's FP. To fix that, this patch sets FrameSetup/FrameDestroy flags on the callee saved register spill/reload instructions which is actually correct. Then the RISCVInstrInfo::loadRegFromStackSlot uses FrameDestroy flag to identify a reload of the callee saved register in the epilogue and copies the source line information from insert position instruction to that reload instruction. Requires PR #120622 Fixes #120553	2025-01-28 21:03:12 +05:30
Daniil Fukalov	68d90cff58	[AMDGPU][GlobalISel] Fix assert on APInt creation. (#124608 ) Since 3494ee95902cef62f767489802e469c58a13ea04 APInt stopped to implicitly truncate values, therefore it asserts on a big signed value converted to (implicitly) unsigned APInt. The change explicitly marks offset as a signed value.	2025-01-28 15:53:17 +01:00
Stephen Tozer	22687aa97b	[CodeGen] Correctly handle non-standard cases in RemoveLoadsIntoFakeUses (#111551 ) In the RemoveLoadsIntoFakeUses pass, we try to remove loads that are only used by fake uses, as well as the fake use in question. There are two existing errors with the pass however: it incorrectly examines every operand of each FAKE_USE, when only the first is relevant (extra operands will just be "killed" regs assigned by a previous pass), and it ignores cases where the FAKE_USE register is not an exact match for the loaded register, which is incorrect as regalloc may choose to load a wider value than the FAKE_USE required pre-regalloc. This patch fixes both of these cases.	2025-01-28 13:59:41 +00:00
Renat Idrisov	11db7fb09b	[GlobalISel] Catching inconsistencies in load memory, result, and range metadata type (#121247 ) This is a fix for: https://github.com/llvm/llvm-project/issues/97290 Please let me know if that is the right way to address the issue. Thank you! --------- Co-authored-by: Renat Idrisov <parsifal-47@users.noreply.github.com> Co-authored-by: Matt Arsenault <arsenm2@gmail.com>	2025-01-28 20:54:34 +07:00
abhishek-kaushik22	015aed18ee	[SelectionDAG] WidenVecOp_INSERT_SUBVECTOR - Replace `INSERT_SUBVECTOR` with series of `INSERT_VECTOR_ELT` (#124420 ) If the operands to `INSERT_SUBVECTOR` can't be widened legally, just replace the `INSERT_SUBVECTOR` with a series of `INSERT_VECTOR_ELT`. Closes #124255 (and possibly #102016)	2025-01-28 18:54:49 +05:30
Luke Lau	500a1834d9	[RISCV][VLOPT] Fix some typos in vl-opt-op-info.mir test. NFC vleN_v_incompatible_emul reassigns to %x and vsuxeiN_v_idx_incompatible_eew has a dead instruction	2025-01-28 20:28:02 +08:00
Pierre van Houtryve	8ea018ce1d	[DAGISel] Fix MMRA Handling in copyExtraInfo (#124730 ) #78569 did not implement this correctly and an edge case breaks it by triggering `Assertion `!Leafs.empty()' failed.` Fixes SWDEV-507698	2025-01-28 13:27:26 +01:00
Cullen Rhodes	8017ca1d00	Reapply "[AArch64] Combine and and lsl into ubfiz" (#123356 ) (#124576 ) Patch was reverted due to test case (added) exposing an infinite loop in combiner, where (shl C1, C2) create by performSHLCombine isn't constant-folded: Combining: t14: i64 = shl t12, Constant:i64<1> Creating new node: t36: i64 = shl OpaqueConstant:i64<-2401053089408754003>, Constant:i64<1> Creating new node: t37: i64 = shl t6, Constant:i64<1> Creating new node: t38: i64 = and t37, t36 ... into: t38: i64 = and t37, t36 ... Combining: t38: i64 = and t37, t36 Creating new node: t39: i64 = and t6, OpaqueConstant:i64<-2401053089408754003> Creating new node: t40: i64 = shl t39, Constant:i64<1> ... into: t40: i64 = shl t39, Constant:i64<1> and subsequently gets simplified by DAGCombiner::visitAND: // Simplify: (and (op x...), (op y...)) -> (op (and x, y)) if (N0.getOpcode() == N1.getOpcode()) if (SDValue V = hoistLogicOpWithSameOpcodeHands(N)) return V; before being folded by performSHLCombine once again and so on. The combine in performSHLCombine should only be done if (shl C1, C2) can be constant-folded, it may otherwise be unsafe and generally have a worse end result. Thanks to Dave Sherwood for his insight on this one. This reverts commit f719771f251d7c30eca448133fe85730f19a6bd1.	2025-01-28 11:27:34 +00:00
Csanád Hajdú	4a00c84fbb	[AArch64] Allow register offset addressing mode for prefetch (#124534 ) Previously instruction selection failed to generate PRFM instructions with register offsets because `AArch64ISD::PREFETCH` is not a `MemSDNode`.	2025-01-28 09:16:40 +00:00
Aaditya	cd57c9530b	[NFC][AMDGPU] Autogenerating test cases (#124507 )	2025-01-28 13:41:59 +05:30
Adam Yang	aab25f20f6	[HLSL][SPIRV][DXIL] Implement `WaveActiveMax` intrinsic (#123428 ) ``` - add clang builtin to Builtins.td - link builtin in hlsl_intrinsics - add codegen for spirv intrinsic and two directx intrinsics to retain signedness information of the operands in CGBuiltin.cpp - add semantic analysis in SemaHLSL.cpp - add lowering of spirv intrinsic to spirv backend in SPIRVInstructionSelector.cpp - add lowering of directx intrinsics to WaveActiveOp dxil op in DXIL.td - add test cases to illustrate passespendent pr merges. ``` Resolves #99170	2025-01-27 23:26:56 -08:00
Djordje Todorovic	0cb7636a46	[RISCV] Add MIPS extensions (#121394 ) Adding two extensions for MIPS p8700 CPU: 1. cmove (conditional move) 2. lsp (load/store pair) The official product page here: https://mips.com/products/hardware/p8700	2025-01-28 08:04:09 +01:00
Craig Topper	d4af658323	[RISCV] Support multiple memory operands in expandRV32ZdinxStore. TailMerge can create stores with multiple memory operands. We need to split all of them instead of assuming there is only one.	2025-01-27 22:10:51 -08:00
quic_hchandel	2d0688797c	[RISCV] Renaming muladdi to muliadd as per v0.5 spec. (#124237 ) muliadd is more relevant to the operation performed, i.e. multiply by immediate. The latest spec can be found at: https://github.com/quic/riscv-unified-db/releases/latest	2025-01-27 20:40:45 -08:00
Shilei Tian	6e4105574e	[NFC][AMDGPU] Improve code introduced in #124607 (#124672 )	2025-01-27 22:57:16 -05:00
Momchil Velikov	f75860f895	[AArch64] Implement NEON FP8 intrinsics for fused multiply-add (#123615 ) This patch adds the following intrinsics: * Fused multiply-add non-indexed float16x8_t vmlalbq_f16_mf8_fpm(float16x8_t, mfloat8x16_t, mfloat8x16_t, fpm_t) float16x8_t vmlaltq_f16_mf8_fpm(float16x8_t, mfloat8x16_t, mfloat8x16_t, fpm_t) float32x4_t vmlallbbq_f32_mf8_fpm(float32x4_t, mfloat8x16_t, mfloat8x16_t, fpm_t) float32x4_t vmlallbtq_f32_mf8_fpm(float32x4_t, mfloat8x16_t, mfloat8x16_t, fpm_t) float32x4_t vmlalltbq_f32_mf8_fpm(float32x4_t, mfloat8x16_t, mfloat8x16_t, fpm_t) float32x4_t vmlallttq_f32_mf8_fpm(float32x4_t, mfloat8x16_t, mfloat8x16_t, fpm_t) * Floating-point multiply-add long to half-precision (vector, by element) float16x8_t vmlalbq_lane_f16_mf8_fpm(float16x8_t vd, mfloat8x16_t vn, mfloat8x8_t vm, __builtin_constant_p(lane), fpm_t fpm) float16x8_t vmlalbq_laneq_f16_mf8_fpm(float16x8_t vd, mfloat8x16_t vn, mfloat8x16_t vm, __builtin_constant_p(lane), fpm_t fpm) float16x8_t vmlaltq_lane_f16_mf8_fpm(float16x8_t vd, mfloat8x16_t vn, mfloat8x8_t vm, __builtin_constant_p(lane), fpm_t fpm) float16x8_t vmlaltq_laneq_f16_mf8_fpm(float16x8_t vd, mfloat8x16_t vn, mfloat8x16_t vm, __builtin_constant_p(lane), fpm_t fpm) * Floating-point multiply-add long-long to single-precision (vector, by element) float32x4_t vmlallbbq_lane_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn, mfloat8x8_t vm, __builtin_constant_p(lane), fpm_t fpm) float32x4_t vmlallbbq_laneq_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn, mfloat8x16_t vm, __builtin_constant_p(lane), fpm_t fpm) float32x4_t vmlallbtq_lane_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn, mfloat8x8_t vm, __builtin_constant_p(lane), fpm_t fpm) float32x4_t vmlallbtq_laneq_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn, mfloat8x16_t vm, __builtin_constant_p(lane), fpm_t fpm) float32x4_t vmlalltbq_lane_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn, mfloat8x8_t vm, __builtin_constant_p(lane), fpm_t fpm) float32x4_t vmlalltbq_laneq_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn, mfloat8x16_t vm, __builtin_constant_p(lane), fpm_t fpm) float32x4_t vmlallttq_lane_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn, mfloat8x8_t vm, __builtin_constant_p(lane), fpm_t fpm) float32x4_t vmlallttq_laneq_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn, mfloat8x16_t vm, __builtin_constant_p(lane), fpm_t fpm)	2025-01-28 00:38:44 +00:00
Shilei Tian	3b2b7ec07d	[AMDGPU] Handle invariant marks in `AMDGPUPromoteAllocaPass` (#124607 ) Fixes SWDEV-509327.	2025-01-27 17:30:50 -05:00
David Green	5a81a559d6	[GISel] Explicitly disable BF16 tablegen patterns. (#124113 ) We currently have an issue where bf16 patters can be used to match fp16 types, as GISel does not know about the difference between the two. This patch explicitly disables them to make sure that they are never used. The opposite can also happen too, where fp16 patterns are used for operators that should be bf16. So this also changes any operations with bf16 types to now cause a fallback to SDAG. The pass setup for GISel has been slightly adjusted to make sure that a verify pass does not get added between AMD-SDAG and SIFixSGPRCopiesPass, which otherwise can cause verifier issues when falling back.	2025-01-27 22:21:12 +00:00
Momchil Velikov	804b81d39f	[AArch64] Add FP8 Neon intrinsics for dot-product (#123613 ) This patch adds the following intrinsics: float16x4_t vdot_f16_mf8_fpm(float16x4_t vd, mfloat8x8_t vn, mfloat8x8_t vm, fpm_t fpm) float16x8_t vdotq_f16_mf8_fpm(float16x8_t vd, mfloat8x16_t vn, mfloat8x16_t vm, fpm_t fpm) float16x4_t vdot_lane_f16_mf8_fpm(float16x4_t vd, mfloat8x8_t vn, mfloat8x8_t vm, __builtin_constant_p(lane), fpm_t fpm) float16x4_t vdot_laneq_f16_mf8_fpm(float16x4_t vd, mfloat8x8_t vn, mfloat8x16_t vm, __builtin_constant_p(lane), fpm_t fpm) float16x8_t vdotq_lane_f16_mf8_fpm(float16x8_t vd, mfloat8x16_t vn, mfloat8x8_t vm, __builtin_constant_p(lane), fpm_t fpm) float16x8_t vdotq_laneq_f16_mf8_fpm(float16x8_t vd, mfloat8x16_t vn, mfloat8x16_t vm, __builtin_constant_p(lane), fpm_t fpm) float32x2_t vdot_f32_mf8_fpm(float32x2_t vd, mfloat8x8_t vn, mfloat8x8_t vm, fpm_t fpm) float32x4_t vdotq_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn, mfloat8x16_t vm, fpm_t fpm) float32x2_t vdot_lane_f32_mf8_fpm(float32x2_t vd, mfloat8x8_t vn, mfloat8x8_t vm, __builtin_constant_p(lane), fpm_t fpm) float32x2_t vdot_laneq_f32_mf8_fpm(float32x2_t vd, mfloat8x8_t vn, mfloat8x16_t vm, __builtin_constant_p(lane), fpm_t fpm) float32x4_t vdotq_lane_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn, mfloat8x8_t vm, __builtin_constant_p(lane), fpm_t fpm) float32x4_t vdotq_laneq_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn, mfloat8x16_t vm, __builtin_constant_p(lane), fpm_t fpm)	2025-01-27 21:14:16 +00:00
Craig Topper	aa34a6ab29	[RISCV] Add register allocation hints for lui/auipc+addi fusion. (#123860 ) Spotted the auipc case while looking at code for P550. I'm not sure this is the right long term fix. We're still missing rematerialization opportunities for these pairs so a pseudo might be better. That would interfere with folding auipc+add into load/store addressing though. Fixes #76779.	2025-01-27 11:16:22 -08:00
Heejin Ahn	539b2e0654	[WebAssembly] Fix catch block type in wasm64 (#124381 ) `try_table`'s `catch` or `catch_ref`'s target block's return type should be `i64` and `(i64, exnref)` in case of wasm64.	2025-01-27 11:01:48 -08:00
Jeffrey Byrnes	e77d428e46	[AMDGPU] Do not remat instructions with PhysReg uses (#124366 ) This blocks rematerialization during scheduling if the instruction has a non accepted PhysReg use. Currently, there aren't any checks like this in place, and we may create invalid code: https://godbolt.org/z/xjPjdcorf	2025-01-27 10:50:06 -08:00
Brox Chen	d1139b32d2	[AMDGPU][True16][CodeGen] true16 codegen pats for v_mad_u16 (#124000 ) true16 codegen pats for v_mad_u16 (mul+add)	2025-01-27 13:47:17 -05:00
Momchil Velikov	5d6d982df6	[AArch64] Generate zeroing forms of certain SVE2.2 instructions (11/11) (#116837 ) SVE2.2 introduces instructions with predicated forms with zeroing of the inactive lanes. This allows in some cases to save a `movprfx` or a `mov` instruction when emitting code for `_x` or `_z` variants of intrinsics. This patch adds support for emitting the zeroing forms of certain `SXTB`, `UXTB`, `SXTH`, `UXTH`, `SXTW`, and `UXTW` instructions.	2025-01-27 18:12:00 +00:00
Momchil Velikov	99bd2e3f12	[AArch64] Add Neon FP8 conversion intrinsics (#123612 ) The patch adds the following intrinsics: bfloat16x8_t vcvt1_bf16_mf8_fpm(mfloat8x8_t vn, fpm_t fpm) bfloat16x8_t vcvt1_low_bf16_mf8_fpm(mfloat8x16_t vn, fpm_t fpm) bfloat16x8_t vcvt2_bf16_mf8_fpm(mfloat8x8_t vn, fpm_t fpm) bfloat16x8_t vcvt2_low_bf16_mf8_fpm(mfloat8x16_t vn, fpm_t fpm) bfloat16x8_t vcvt1_high_bf16_mf8_fpm(mfloat8x16_t vn, fpm_t fpm) bfloat16x8_t vcvt2_high_bf16_mf8_fpm(mfloat8x16_t vn, fpm_t fpm) float16x8_t vcvt1_f16_mf8_fpm(mfloat8x8_t vn, fpm_t fpm) float16x8_t vcvt1_low_f16_mf8_fpm(mfloat8x16_t vn, fpm_t fpm) float16x8_t vcvt2_f16_mf8_fpm(mfloat8x8_t vn, fpm_t fpm) float16x8_t vcvt2_low_f16_mf8_fpm(mfloat8x16_t vn, fpm_t fpm) float16x8_t vcvt1_high_f16_mf8_fpm(mfloat8x16_t vn, fpm_t fpm) float16x8_t vcvt2_high_f16_mf8_fpm(mfloat8x16_t vn, fpm_t fpm) mfloat8x8_t vcvt_mf8_f32_fpm(float32x4_t vn, float32x4_t vm, fpm_t fpm) mfloat8x16_t vcvt_high_mf8_f32_fpm(mfloat8x8_t vd, float32x4_t vn, float32x4_t vm, fpm_t fpm) mfloat8x8_t vcvt_mf8_f16_fpm(float16x4_t vn, float16x4_t vm, fpm_t fpm) mfloat8x16_t vcvtq_mf8_f16_fpm(float16x8_t vn, float16x8_t vm, fpm_t fpm) Co-Authored-By: Caroline Concatto <caroline.concatto@arm.com>	2025-01-27 17:32:47 +00:00
Momchil Velikov	4e231014c1	[AArch64] Generate zeroing forms of certain SVE2.2 instructions (10/11) (#116836 ) SVE2.2 introduces instructions with predicated forms with zeroing of the inactive lanes. This allows in some cases to save a `movprfx` or a `mov` instruction when emitting code for `_x` or `_z` variants of intrinsics. This patch adds support for emitting the zeroing forms of certain `RBIT`, `REVB`, `REVH`, `REVW`, and `REVD` instructions.	2025-01-27 16:45:40 +00:00
Luke Lau	cb6f021af2	[RISCV][VLOPT] Remove unnecessary passthru restriction (#124549 ) We currently check for passthrus in two places, on the instruction to reduce in isCandidate, and on the users in checkUsers. We cannot reduce the VL if an instruction has a user that's a passthru, because the user will read elements past VL in the tail. However it's fine to reduce an instruction if it itself contains a non-undef passthru. Since the VL can only be reduced, not increased, the previous tail will always remain the same.	2025-01-27 23:54:32 +08:00
Momchil Velikov	f95f10c7e6	[AArch64] Generate zeroing forms of certain SVE2.2 instructions (9/11) (#116835 ) SVE2.2 introduces instructions with predicated forms with zeroing of the inactive lanes. This allows in some cases to save a `movprfx` or a `mov` instruction when emitting code for `_x` or `_z` variants of intrinsics. This patch adds support for emitting the zeroing forms of certain `URECPE`, `URSQRTE`, `SQABS` and `SQNEG` instructions.	2025-01-27 15:50:53 +00:00
Simon Pilgrim	86705eb624	[X86] huge-stack-offset.ll - add gnux32 test coverage This should match x86 for the basic implementation, but its useful to check it actually runs correctly.	2025-01-27 14:10:16 +00:00
David Green	ef54e0bbfb	[AArch64] Avoid generating LDAPUR on certain cores (#124274 ) On the CPUs listed below, we want to avoid LDAPUR for performance reasons. Add a tuning feature to disable them when using: -mcpu=neoverse-v2 -mcpu=neoverse-v3 -mcpu=cortex-x3 -mcpu=cortex-x4 -mcpu=cortex-x925	2025-01-27 13:12:11 +00:00
Momchil Velikov	d8ad1eef8f	[AArch64] Generate zeroing forms of certain SVE2.2 instructions (7/11) (#116833 ) SVE2.2 introduces instructions with predicated forms with zeroing of the inactive lanes. This allows in some cases to save a `movprfx` or a `mov` instruction when emitting code for `_x` or `_z` variants of intrinsics. This patch adds support for emitting the zeroing forms of certain `FLOGB` instructions.	2025-01-27 12:53:38 +00:00
Durgadoss R	3b5e9eed2f	[NVPTX] Add float to tf32 conversion intrinsics (#124316 ) This patch adds the set of f32 -> tf32 cvt intrinsics introduced in sm100 with ptx8.6. This builds on top of the recent PR #121507. Tests are verified with a 12.8 ptxas executable. PTX ISA link: https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cvt Signed-off-by: Durgadoss R <durgadossr@nvidia.com>	2025-01-27 15:52:43 +05:30

1 2 3 4 5 ...

57176 Commits