llvm-project

Author	SHA1	Message	Date
Craig Topper	dd3edc8365	[CodeGen] Add Register::stackSlotIndex(). Replace uses of Register::stackSlot2Index. NFC (#125028 )	2025-01-29 23:02:07 -08:00
Carl Ritson	a3a3e6997b	[AMDGPU] Rewrite GFX12 SGPR hazard handling to dedicated pass (#118750 ) - Algorithm operates over whole IR to attempt to minimize waits. - Add support for VALU->VALU SGPR hazards via VA_SDST/VA_VCC.	2025-01-30 11:21:11 +09:00
Joel E. Denny	18f8106f31	[KernelInfo] Implement new LLVM IR pass for GPU code analysis (#102944 ) This patch implements an LLVM IR pass, named kernel-info, that reports various statistics for codes compiled for GPUs. The ultimate goal of these statistics to help identify bad code patterns and ways to mitigate them. The pass operates at the LLVM IR level so that it can, in theory, support any LLVM-based compiler for programming languages supporting GPUs. It has been tested so far with LLVM IR generated by Clang for OpenMP offload codes targeting NVIDIA GPUs and AMD GPUs. By default, the pass runs at the end of LTO, and options like ``-Rpass=kernel-info`` enable its remarks. Example `opt` and `clang` command lines appear in `llvm/docs/KernelInfo.rst`. Remarks include summary statistics (e.g., total size of static allocas) and individual occurrences (e.g., source location of each alloca). Examples of its output appear in tests in `llvm/test/Analysis/KernelInfo`.	2025-01-29 12:40:19 -05:00
Konstantina Mitropoulou	9adc99bcc5	[AMDGPU] Always emit SI_KILL_I1_PSEUDO for uniform floating point branches. (#124028 ) - [NFC] Use GCNPat instead of Pat. - [AMDGPU] Always emit SI_KILL_I1_PSEUDO for uniform floating point branches. --------- Co-authored-by: Konstantina Mitropoulou <KonstantinaMitropoulou@amd.com>	2025-01-29 09:00:40 -08:00
Juan Manuel Martinez Caamaño	0c63ec5347	[NFC][SIWholeQuadMode] Remove redundant arguments (#124930 )	2025-01-29 16:33:15 +01:00
Juan Manuel Martinez Caamaño	2e43f39223	[NFC][SIWholeQuadMode] Perform less lookups (#124927 )	2025-01-29 15:36:54 +01:00
Acim Maravic	3a29dfe37c	[LLVM][AMDGPU] Add Intrinsic and Builtin for ds_bpermute_fi_b32 (#124616 )	2025-01-29 14:04:10 +01:00
Ivan Kosarev	983562d8c5	[AMDGPU][NFC] Simplify t16/fake16 TableGen definitions. (#122693 ) Infer mnemonics from the names of the records.	2025-01-29 12:46:05 +00:00
Akshat Oke	71edfd6230	[AMDGPU][NewPM] Sketch out a AMDGPUPassRegistry skeleton (#124785 ) Add a dummy pass skeleton list to help track the progress in porting passes to NPM.	2025-01-29 13:26:50 +05:30
Daniil Fukalov	68d90cff58	[AMDGPU][GlobalISel] Fix assert on APInt creation. (#124608 ) Since 3494ee95902cef62f767489802e469c58a13ea04 APInt stopped to implicitly truncate values, therefore it asserts on a big signed value converted to (implicitly) unsigned APInt. The change explicitly marks offset as a signed value.	2025-01-28 15:53:17 +01:00
Akshat Oke	42432ada8e	[AMDGPU][NFC] Sort AMDGPUPassRegistry entries alphabetically (#124544 )	2025-01-28 11:25:56 +05:30
Matt Arsenault	cc97653d53	AMDGPU: Custom lower 32-bit element shuffles (#123711 ) This is so we can try to make use of v_pk_mov_b32 when available. Note this currently has little observable effect. The combiner will undo the common extract of shuffle pattern. The lack of test changes should demonstrate this change is minimally correct. We should probably try to make better use of wider extracts in even aligned cases, but I'm trying to avoid some really ugly regalloc regressions in some MFMA tests. The DAG scheduler ends up doing a worse job if we use vector extracts, resulting in failure to do 3 address conversion of MFMAs.	2025-01-28 11:17:10 +07:00
Shilei Tian	6e4105574e	[NFC][AMDGPU] Improve code introduced in #124607 (#124672 )	2025-01-27 22:57:16 -05:00
Shilei Tian	3b2b7ec07d	[AMDGPU] Handle invariant marks in `AMDGPUPromoteAllocaPass` (#124607 ) Fixes SWDEV-509327.	2025-01-27 17:30:50 -05:00
Brox Chen	5d1c596ab4	[AMDGPU][True16][MC] true16 for minimummaximum/max/min/max3/min3 (#124184 ) true16 support for gfx12 instructions including: v_minimummaximum_f16 v_maximumminimum_f16 v_maximum_f16 v_minimum_f16 v_maximum3_f16 v_minimum3_f16	2025-01-27 16:52:59 -05:00
Jeffrey Byrnes	e77d428e46	[AMDGPU] Do not remat instructions with PhysReg uses (#124366 ) This blocks rematerialization during scheduling if the instruction has a non accepted PhysReg use. Currently, there aren't any checks like this in place, and we may create invalid code: https://godbolt.org/z/xjPjdcorf	2025-01-27 10:50:06 -08:00
Brox Chen	d1139b32d2	[AMDGPU][True16][CodeGen] true16 codegen pats for v_mad_u16 (#124000 ) true16 codegen pats for v_mad_u16 (mul+add)	2025-01-27 13:47:17 -05:00
Jeremy Morse	81d18ad864	[NFC][DebugInfo] Make some block-start-position methods return iterators (#124287 ) As part of the "RemoveDIs" work to eliminate debug intrinsics, we're replacing methods that use Instruction's as positions with iterators. A number of these (such as getFirstNonPHIOrDbg) are sufficiently infrequently used that we can just replace the pointer-returning version with an iterator-returning version, hopefully without much/any disruption. Thus this patch has getFirstNonPHIOrDbg and getFirstNonPHIOrDbgOrLifetime return an iterator, and updates all call-sites. There are no concerns about the iterators returned being converted to Instruction's and losing the debug-info bit: because the methods skip debug intrinsics, the iterator head bit is always false anyway.	2025-01-27 16:27:54 +00:00
Brox Chen	62340ff8d8	[AMDGPU][True16][MC] true16 for v_cmpx_xx_f16 (#123419 ) A bulk commit of true16 support for v_cmpx_xx_f16 instructions including: v_cmpx_f_f16 v_cmpx_le_f16 v_cmpx_gt_f16 v_cmpx_lg_f16 v_cmpx_ge_f16 v_cmpx_o_f16 v_cmpx_u_f16 v_cmpx_nge_f16 v_cmpx_nlg_f16 v_cmpx_ngt_f16 v_cmpx_nle_f16 v_cmpx_neq_f16 v_cmpx_nlt_f16 v_cmpx_t_f16 v_cmpx_eq_f16 is not in this patch and will be added in the following patch	2025-01-27 10:12:20 -05:00
Craig Topper	f46eb14309	[AMDGPU] Replace unsigned with Register in SIMachineScheduler. NFC Some of these may eventually need to VirtRegOrUnit.	2025-01-26 00:26:00 -08:00
Brox Chen	241e5d8c5c	[AMDGPU][True16][MC] true16 for v_cmpx_eq_f16 (#124038 ) True16 format for v_cmpx_eq_f16. Also cleaned up some stray gfx11 check line in gfx12 dasm test	2025-01-24 18:15:40 -05:00
Brox Chen	ec66c4af09	[AMDGPU][True16][CodeGen] true16 codegen pattern for f16 canonicalize (#122000 ) true16 codegen pattern for f16 canonicalize	2025-01-24 10:44:00 -05:00
Aaditya	11b0401926	[AMDGPU] Restore SP from saved-FP or saved-BP (#124007 ) Currently, the AMDGPU backend bumps the Stack Pointer by fixed size offsets in the prolog of device functions, and restores it by the same amount in the epilog. Prolog: sp += frameSize Epilog: sp -= frameSize If a function has dynamic stack realignment, Prolog: sp += frameSize + max_alignment Epilog: sp -= frameSize + max_alignment These calculations are not optimal in case of dynamic stack realignment, and completely fail in case of dynamic stack readjustment. This patch uses the saved Frame Pointer to restore SP. Prolog: fp = sp sp += frameSize Epilog: sp = fp In case of dynamic stack realignment, SP is restored from the saved Base Pointer. Prolog: fp = sp + (max_alignment - 1) fp = fp & (-max_alignment) bp = sp sp += frameSize + max_alignment Epilog: sp = bp (Note: The presence of BP has been enforced in case of any dynamic stack realignment.) --------- Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com> Co-authored-by: Matt Arsenault <arsenm2@gmail.com>	2025-01-24 19:13:40 +05:30
Petar Avramovic	4831fa8632	AMDGPU/GlobalISel: RegBankLegalize rules for load (#112882 ) Add IDs for bit width that cover multiple LLTs: B32 B64 etc. "Predicate" wrapper class for bool predicate functions used to write pretty rules. Predicates can be combined using &&, \|\| and !. Lowering for splitting and widening loads. Write rules for loads to not change existing mir tests from old regbankselect.	2025-01-24 12:36:41 +01:00
Petar Avramovic	0ee037b861	AMDGPU/GlobalISel: AMDGPURegBankLegalize (#112864 ) Lower G_ instructions that can't be inst-selected with register bank assignment from AMDGPURegBankSelect based on uniformity analysis. - Lower instruction to perform it on assigned register bank - Put uniform value in vgpr because SALU instruction is not available - Execute divergent instruction in SALU - "waterfall loop" Given LLTs on all operands after legalizer, some register bank assignments require lowering while other do not. Note: cases where all register bank assignments would require lowering are lowered in legalizer. AMDGPURegBankLegalize goals: - Define Rules: when and how to perform lowering - Goal of defining Rules it to provide high level table-like brief overview of how to lower generic instructions based on available target features and uniformity info (uniform vs divergent). - Fast search of Rules, depends on how complicated Rule.Predicate is - For some opcodes there would be too many Rules that are essentially all the same just for different combinations of types and banks. Write custom function that handles all cases. - Rules are made from enum IDs that correspond to each operand. Names of IDs are meant to give brief description what lowering does for each operand or the whole instruction. - AMDGPURegBankLegalizeHelper implements lowering algorithms Since this is the first patch that actually enables -new-reg-bank-select here is the summary of regression tests that were added earlier: - if instruction is uniform always select SALU instruction if available - eliminate back to back vgpr to sgpr to vgpr copies of uniform values - fast rules: small differences for standard and vector instruction - enabling Rule based on target feature - salu_float - how to specify lowering algorithm - vgpr S64 AND to S32 - on G_TRUNC in reg, it is up to user to deal with truncated bits G_TRUNC in reg is treated as no-op. - dealing with truncated high bits - ABS S16 to S32 - sgpr S1 phi lowering - new opcodes for vcc-to-scc and scc-to-vcc copies - lowering for vgprS1-to-vcc copy (formally this is vgpr-to-vcc G_TRUNC) - S1 zext and sext lowering to select - uniform and divergent S1 AND(OR and XOR) lowering - inst-selected into SALU instruction - divergent phi with uniform inputs - divergent instruction with temporal divergent use, source instruction is defined as uniform(AMDGPURegBankSelect) - missing temporal divergence lowering - uniform phi, because of undef incoming, is assigned to vgpr. Will be fixed in AMDGPURegBankSelect via another fix in machine uniformity analysis.	2025-01-24 12:12:45 +01:00
Jeremy Morse	8e70273509	[NFC][DebugInfo] Use iterator moveBefore at many call-sites (#123583 ) As part of the "RemoveDIs" project, BasicBlock::iterator now carries a debug-info bit that's needed when getFirstNonPHI and similar feed into instruction insertion positions. Call-sites where that's necessary were updated a year ago; but to ensure some type safety however, we'd like to have all calls to moveBefore use iterators. This patch adds a (guaranteed dereferenceable) iterator-taking moveBefore, and changes a bunch of call-sites where it's obviously safe to change to use it by just calling getIterator() on an instruction pointer. A follow-up patch will contain less-obviously-safe changes. We'll eventually deprecate and remove the instruction-pointer insertBefore, but not before adding concise documentation of what considerations are needed (very few).	2025-01-24 10:53:11 +00:00
Petar Avramovic	f8a56df36e	AMDGPU/GlobalISel: AMDGPURegBankSelect (#112863 ) Assign register banks to virtual registers. Does not use generic RegBankSelect. After register bank selection all register operand of G_ instructions have LLT and register banks exclusively. If they had register class, reassign appropriate register bank. Assign register banks using machine uniformity analysis: Sgpr - uniform values and some lane masks Vgpr - divergent, non S1, values Vcc - divergent S1 values(lane masks) AMDGPURegBankSelect does not consider available instructions and, in some cases, G_ instructions with some register bank assignment can't be inst-selected. This is solved in RegBankLegalize. Exceptions when uniformity analysis does not work: S32/S64 lane masks: - need to end up with sgpr register class after instruction selection - In most cases Uniformity analysis declares them as uniform (forced by tablegen) resulting in sgpr S32/S64 reg bank - When Uniformity analysis declares them as divergent (some phis), use intrinsic lane mask analyzer to still assign sgpr register bank temporal divergence copy: - COPY to vgpr with implicit use of $exec inside of the cycle - this copy is declared as uniform by uniformity analysis - make sure that assigned bank is vgpr Note: uniformity analysis does not consider that registers with vgpr def are divergent (you can have uniform value in vgpr). - TODO: implicit use of $exec could be implemented as indicator that instruction is divergent	2025-01-24 11:06:02 +01:00
Frederik Harwath	bfd9bc2745	[AMDGPU] SIPeepholeSDWA: Disable on existing SDWA instructions (#124131 ) This PR reapplies the changes from PR #123942 which had to be reverted because of a test failure. The test has been adjusted.	2025-01-24 09:12:32 +01:00
Chaitanya	3c79a04cc2	[AMDGPU] Add amdgpu-sw-lower-lds pass to NPM codegen addIRPasses. (#124102 ) This PR adds amdgpu-sw-lower-lds pass to AMDGPUCodeGenPassBuilder::addIRPasses()	2025-01-24 11:15:30 +05:30
Acim Maravic	7ddeea3598	[LLVM][AMDGPU] MC support for ds_bpermute_fi_b32 (#124108 ) Added assembler/disassembler support for ds_bpermute_fi_b32 instruction, as well as tests.	2025-01-23 17:55:00 +01:00
Lucas Ramirez	6206f5444f	[AMDGPU] Occupancy w.r.t. workgroup size range is also a range (#123748 ) Occupancy (i.e., the number of waves per EU) depends, in addition to register usage, on per-workgroup LDS usage as well as on the range of possible workgroup sizes. Mirroring the latter, occupancy should therefore be expressed as a range since different group sizes generally yield different achievable occupancies. `getOccupancyWithLocalMemSize` currently returns a scalar occupancy based on the maximum workgroup size and LDS usage. With respect to the workgroup size range, this scalar can be the minimum, the maximum, or neither of the two of the range of achievable occupancies. This commit fixes the function by making it compute and return the range of achievable occupancies w.r.t. workgroup size and LDS usage; it also renames it to `getOccupancyWithWorkGroupSizes` since it is the range of workgroup sizes that produces the range of achievable occupancies. Computing the achievable occupancy range is surprisingly involved. Minimum/maximum workgroup sizes do not necessarily yield maximum/minimum occupancies i.e., sometimes workgroup sizes inside the range yield the occupancy bounds. The implementation finds these sizes in constant time; heavy documentation explains the rationale behind the sometimes relatively obscure calculations. As a justifying example, consider a target with 10 waves / EU, 4 EUs/CU, 64-wide waves. Also consider a function with no LDS usage and a flat workgroup size range of [513,1024]. - A group of 513 items requires 9 waves per group. Only 4 groups made up of 9 waves each can fit fully on a CU at any given time, for a total of 36 waves on the CU, or 9 per EU. However, filling as much as possible the remaining 40-36=4 wave slots without decreasing the number of groups reveals that a larger group of 640 items yields 40 waves on the CU, or 10 per EU. - Similarly, a group of 1024 items requires 16 waves per group. Only 2 groups made up of 16 waves each can fit fully on a CU ay any given time, for a total of 32 waves on the CU, or 8 per EU. However, removing as many waves as possible from the groups without being able to fit another equal-sized group on the CU reveals that a smaller group of 896 items yields 28 waves on the CU, or 7 per EU. Therefore the achievable occupancy range for this function is not [8,9] as the group size bounds directly yield, but [7,10]. Naturally this change causes a lot of test churn as instruction scheduling is driven by achievable occupancy estimates. In most unit tests the flat workgroup size range is the default [1,1024] which, ignoring potential LDS limitations, would previously produce a scalar occupancy of 8 (derived from 1024) on a lot of targets, whereas we now consider the maximum occupancy to be 10 in such cases. Most tests are updated automatically and checked manually for sanity. I also manually changed some non-automatically generated assertions when necessary. Fixes #118220.	2025-01-23 16:07:57 +01:00
Nico Weber	99d450e9f5	Revert "[AMDGPU] SIPeepholeSDWA: Disable on existing SDWA instructions (#123942 )" This reverts commit 6fdaaafd89d7cbc15dafe3ebf1aa3235d148aaab. Breaks check-llvm, see https://github.com/llvm/llvm-project/pull/123942#issuecomment-2609861953	2025-01-23 09:19:42 -05:00
Matt Arsenault	e28e93550a	AMDGPU: Make vector_shuffle legal for v2i32 with v_pk_mov_b32 (#123684 ) For VALU shuffles, this saves an instruction in some case.	2025-01-23 20:58:02 +07:00
Kareem Ergawy	ff55c9bc63	[llvm][amdgpu] Handle indirect refs to LDS GVs during LDS lowering (#124089 ) Fixes #123800 Extends LDS lowering by allowing it to discover transitive indirect/escpaing references to LDS GVs. For example, given the following input: ```llvm @lds_item_to_indirectly_load = internal addrspace(3) global ptr undef, align 8 %store_type = type { i32, ptr } @place_to_store_indirect_caller = internal addrspace(3) global %store_type undef, align 8 define amdgpu_kernel void @offloading_kernel() { store ptr @indirectly_load_lds, ptr addrspace(3) getelementptr inbounds nuw (i8, ptr addrspace(3) @place_to_store_indirect_caller, i32 0), align 8 call void @call_unknown() ret void } define void @call_unknown() { %1 = alloca ptr, align 8 %2 = call i32 %1() ret void } define void @indirectly_load_lds() { call void @directly_load_lds() ret void } define void @directly_load_lds() { %2 = load ptr, ptr addrspace(3) @lds_item_to_indirectly_load, align 8 ret void } ``` With the above input, prior to this patch, LDS lowering failed to lower the reference to `@lds_item_to_indirectly_load` because: 1. it is indirectly called by a function whose address is taken in the kernel. 2. we did not check if the kernel indirectly makes any calls to unknown functions (we only checked the direct calls). Co-authored-by: Jon Chesterfield <jonathan.chesterfield@amd.com>	2025-01-23 14:53:11 +01:00
Frederik Harwath	6fdaaafd89	[AMDGPU] SIPeepholeSDWA: Disable on existing SDWA instructions (#123942 ) This is meant as a short-term workaround for an invalid conversion in this pass that occurs because existing SDWA selections are not correctly taken into account during the conversion. See the draft PR #123221 for an attempt to fix the actual issue. --------- Co-authored-by: Frederik Harwath <fharwath@amd.com>	2025-01-23 14:32:01 +01:00
Brox Chen	18e9d3dbe5	[AMDGPU][True16][MC] true16 for v_cmpx_xx_u/i16 (#123424 ) A bulk commit of true16 support for v_cmp_xx_i/u16 instructions including: v_cmpx_lt_i16 v_cmpx_eq_i16 v_cmpx_le_i16 v_cmpx_gt_i16 v_cmpx_ne_i16 v_cmpx_ge_i16 v_cmpx_lt_u16 v_cmpx_eq_u16 v_cmpx_le_u16 v_cmpx_gt_u16 v_cmpx_ne_u16 v_cmpx_ge_u16	2025-01-22 15:57:16 -05:00
Brox Chen	1cf0af3d32	[AMDGPU][True16][MC] true16 for v_cmpx_class_f16 (#123251 ) True16 format for v_cmpx_class_f16. Update VOPCX_CLASS t16 and fake16 pseudo.	2025-01-22 15:56:58 -05:00
Craig Topper	9e6494c0fb	[CodeGen] Rename RegisterMaskPair to VRegMaskOrUnit. NFC (#123799 ) This holds a physical register unit or virtual register and mask. While I was here I've used emplace_back and removed an unneeded use of a template.	2025-01-22 09:11:22 -08:00
Matt Arsenault	93d35ad5f5	AMDGPU: Delete FillMFMAShadowMutation (#123861 ) No test changes with this removed and it appears to be obsolete.	2025-01-22 22:41:25 +07:00
Akshat Oke	a343b8e595	[AMDGPU][NewPM] Port SILowerWWMCopies to NPM (#123695 )	2025-01-22 14:54:01 +05:30
Venkata Ramanaiah Nalamothu	f7d8336a2f	[llvm] Pass MachineInstr flags to storeRegToStackSlot/loadRegFromStackSlot (NFC) (#120622 ) This patch is in preparation to enable setting the MachineInstr::MIFlag flags, i.e. FrameSetup/FrameDestroy, on callee saved register spill/reload instructions in prologue/epilogue. This eventually helps in setting the prologue_end and epilogue_begin markers more accurately. The DWARF Spec in "6.4 Call Frame Information" says: The code that allocates space on the call frame stack and performs the save operation is called the subroutine’s prologue, and the code that performs the restore operation and deallocates the frame is called its epilogue. which means the callee saved register spills and reloads are part of prologue (a.k.a frame setup) and epilogue (a.k.a frame destruction), respectively. And, IIUC, LLVM backend uses FrameSetup/FrameDestroy flags to identify instructions that are part of call frame setup and destruction. In the trunk, while most targets consistently set FrameSetup/FrameDestroy on save/restore call frame information (CFI) instructions of callee saved registers, they do not consistently set those flags on the actual callee saved register spill/reload instructions. I believe this patch provides a clean mechanism to set FrameSetup/FrameDestroy flags on the actual callee saved register spill/reload instructions as needed. And, by having default argument of MachineInstr::NoFlags for Flags, this patch is a NFC. With this patch, the targets have to just pass FrameSetup/FrameDestroy flag to the storeRegToStackSlot/loadRegFromStackSlot calls from the target derived spillCalleeSavedRegisters and restoreCalleeSavedRegisters to set those flags on callee saved register spill/reload instructions. Also, this patch makes it very easy to set the source line information on callee saved register spill/reload instructions which is needed by the DwarfDebug.cpp implementation to set prologue_end and epilogue_begin markers more accurately. As per DwarfDebug.cpp implementation: prologue_end is the first known non-DBG_VALUE and non-FrameSetup location that marks the beginning of the function body epilogue_begin is the first FrameDestroy location that has been seen in the epilogue basic block With this patch, the targets have to just do the following to set the source line information on callee saved register spill/reload instructions, without hampering the LLVM's efforts to avoid adding source line information on the artificial code generated by the compiler. <Foo>InstrInfo::storeRegToStackSlot() { ... DebugLoc DL = Flags & MachineInstr::FrameSetup ? DebugLoc() : MBB.findDebugLoc(I); ... } <Foo>InstrInfo::loadRegFromStackSlot() { ... DebugLoc DL = Flags & MachineInstr::FrameDestroy ? MBB.findDebugLoc(I) : DebugLoc(); ... } While I understand this patch would break out-of-tree backend builds, I think it is in the right direction. One immediate use case that can benefit from this patch is fixing #120553 becomes simpler.	2025-01-22 13:36:39 +05:30
Kazu Hirata	ceaaa2b9ae	[AMDGPU] Fix warnings This patch fixes: llvm/lib/Target/AMDGPU/SIInstrInfo.cpp:2792:14: error: comparison of integers of different signs: 'unsigned int' and 'int' [-Werror,-Wsign-compare] llvm/lib/Target/AMDGPU/SIInstrInfo.cpp:2797:14: error: comparison of integers of different signs: 'unsigned int' and 'int' [-Werror,-Wsign-compare]	2025-01-21 20:24:30 -08:00
Shoreshen	7c58d6363a	[AMDGPU] Add commute for some VOP3 inst (#121326 ) add commute for some VOP3 inst, allow commute for both inline constant operand, adjust tests Fixes #111205	2025-01-22 11:08:26 +07:00
Shoreshen	e8811ad3cc	[AMDGPU] Fix unreachable reg bit width (#122107 ) Add register class bit width for SReg_256_XNULL and SReg_128_XNULL	2025-01-22 10:05:47 +07:00
Brox Chen	e1c1e74a6f	[AMDGPU][True16][MC] true16 for v_cmp_class_f16 (#122984 ) True16 format for v_cmp_class_f16. Update VOPC_CLASS t16 and fake16 pseudo.	2025-01-21 10:07:14 -05:00
Brox Chen	70632f9566	[AMDGPU][True16][MC] true16 for v_cmp_xx_f16 (#122943 ) A bulk commit of true16 support for v_cmp_xx_f16 instructions including: v_cmp_f_f16 v_cmp_eq_f16 v_cmp_le_f16 v_cmp_gt_f16 v_cmp_lg_f16 v_cmp_ge_f16 v_cmp_o_f16 v_cmp_u_f16 v_cmp_nge_f16 v_cmp_nlg_f16 v_cmp_ngt_f16 v_cmp_nle_f16 v_cmp_neq_f16 v_cmp_nlt_f16 v_cmp_t_f16 Added a GFX12 runline for fcmp.f16	2025-01-21 10:06:22 -05:00
Chinmay Deshpande	9ca1323de1	[AMDGPU] Fix crash due to missing check for FLAT instructions that dont use vector registers when computing VALU hazard (#123627 )	2025-01-21 05:50:58 -08:00
Janek van Oirschot	82944595fa	[AMDGPU] Change scope of resource usage info symbols (#114810 ) Change scope of resource usage info MC symbols to align with the function linkage type	2025-01-21 13:10:06 +00:00
Akshat Oke	7acad6893b	[AMDGPU][CodeGen] SILowerWWMCopies: Declare used analyses (#123710 ) This prevents legacy PM from mistakenly removing these analyses if `SILowerWWMCopies` is the last user of them. (it removes dead analyses after its last use)	2025-01-21 15:33:20 +05:30
Akshat Oke	9b6e8df896	[AMDGPU][NewPM] Port SIFixVGPRCopies to NPM (#123592 ) Extends NPM pipeline support till PostRegAlloc passes (greedy is in the works)	2025-01-21 15:27:46 +05:30

1 2 3 4 5 ...

10135 Commits