llvm-project

Author	SHA1	Message	Date
Jun Wang	86842e1f72	[AMDGPU] New clang option for emitting a waitcnt instruction after each memory instruction (#79236 ) This patch introduces a new command-line option for clang, namely, amdgpu-precise-mem-op (or precise-memory in the backend). When this option is specified, a waitcnt instruction is generated after each memory load/store instruction. The counter values are always 0, but which counters are involved depends on the memory instruction. --------- Co-authored-by: Jun Wang <jun.wang7@amd.com>	2024-04-10 10:47:04 -07:00
Jay Foad	9c58f3a234	[AMDGPU] Fix implicit $vcc operands after parsing MIR (#87781 ) MIParser checks that implicit operands match the instruction definition, so they have to be $vcc even in wave32 mode. Use the mirFileLoaded hook to fix them after MIParser's checks, converting them to $vcc_lo which is what that rest of CodeGen expects. This is all just extending the fixImplicitOperands hack which was introduced with GFX10, but at least it makes it possible to write a MIR test which creates the same instructions that normal CodeGen would generate.	2024-04-09 09:10:45 +01:00
Matt Arsenault	acb2a47576	AMDGPU: Regenerate test checks	2024-04-08 08:17:09 -04:00
David Green	ac321cbb03	[AArch64][GlobalISel] Legalize Insert vector element (#81453 ) This attempts to standardize and extend some of the insert vector element lowering. Most notably: - More types are handled by splitting illegal vectors. - The index type for G_INSERT_VECTOR_ELT is canonicalized to TLI.getVectorIdxTy(), similar to extact_vector_element. - Some of the existing patterns now have the index type specified to make sure they can apply to GISel too. - The C++ selection code has been removed, relying on tablegen patterns. - G_INSERT_VECTOR_ELT with small GPR input elements are pre-selected to use a i32 type, allowing the existing patterns to apply. - Variable index inserts are lowered in post-legalizer lowering, expanding into a stack store and reload.	2024-04-08 08:44:13 +01:00
Bevin Hansson	110c22fe12	[ExpandLargeFpConvert] Support bfloat. (#87619 ) The conversion expansions did not properly handle bfloat types. I'm not certain that these expansions are completely correct; I don't have any experience with AMDGPU or the ability to run anything to test it. Note that it doesn't seem like AMDGPU with GlobalISel can handle fptrunc of float to bfloat, which is needed for itofp. I've omitted the GISEL run for the bfloat case. This fixes #85379.	2024-04-08 09:07:55 +02:00
Piotr Sobczak	5b59ae423a	[DAG] Preserve NUW when reassociating (#87621 ) Similarly to the generic case below, preserve the NUW flag when reassociating adds with constants.	2024-04-04 16:47:25 +02:00
Jay Foad	3cf539fb04	[AMDGPU] Combine or remove redundant waitcnts at the end of each MBB (#87539 ) Call generateWaitcnt unconditionally at the end of SIInsertWaitcnts::insertWaitcntInBlock. Even if we don't need to generate a new waitcnt instruction it has the effect of combining or removing redundant waitcnts that were already present. Tests show various small improvements in waitcnt placement.	2024-04-04 10:14:16 +01:00
Bevin Hansson	cd6434f9ec	[ExpandLargeDivRem] Scalarize vector types. (#86959 ) expand-large-divrem cannot handle vector types. If overly large vector element types survive into isel, they will likely be scalarized there, but since isel cannot handle scalar integer types of that size, it will assert. Handle vector types in expand-large-divrem by scalarizing them and then expanding the scalar type operation. For large vectors, this results in a massive code expansion, but it's better than asserting.	2024-04-02 16:37:36 +02:00
Sameer Sahasrabuddhe	421557974a	[AMDGPU] Use glue for convergence tokens at call-like operations (#86766 ) The earlier implementation on AMDGPU used explicit token operands at SI_CALL and SI_CALL_ISEL. This is now replaced with CONVERGENCECTRL_GLUE operands, with the following effects: - The treatment of tokens at call-like operations is now consistent with the treatment at intrinsics. - Support for tail calls using implicit tokens at SI_TCRETURN "just works". - The extra parameter at call-like instructions is eliminated, thus restoring those instructions and their handling to the original state. The new glue node is placed after the existing glue node for the outgoing call parameters, which seems to not interfere with selection of the call-like nodes.	2024-04-01 10:51:13 +05:30
Vitaly Buka	20f56e1f8e	[CodeGen] Add default lowering for llvm.allow.{runtime,ubsan}.check() (#86049 ) RFC: https://discourse.llvm.org/t/rfc-add-llvm-experimental-hot-intrinsic-or-llvm-hot/77641	2024-03-31 22:19:33 -07:00
Ruiling, Song	216b5e9666	[AMDGPU] Expose RTZ version of f16 interpolation for gfx11+ (#86614 )	2024-04-01 09:48:37 +08:00
Austin Kerbow	b5b34dbb27	[AMDGPU] Use directive for kernarg preload header padding (#86004 )	2024-03-31 11:03:03 -07:00
Austin Kerbow	0234d90d81	[AMDGPU] Extend MFMA padding option to gfx90a+ (#86768 ) It was shown experimentally that this may have some benefit on newer HW.	2024-03-31 10:46:05 -07:00
Jay Foad	95258419f6	[AMDGPU] Use AMDGPU::isIntrinsicAlwaysUniform in isSDNodeAlwaysUniform (#87085 ) This is mostly just a simplification, but tests show a slight codegen improvement in code using the deprecated amdgcn.icmp/fcmp intrinsics.	2024-03-30 08:01:18 +00:00
Shilei Tian	3a106e5b2c	[GlobalISel] Fold G_ICMP if possible (#86357 ) This patch tries to fold `G_ICMP` if possible.	2024-03-29 15:59:50 -04:00
Shilei Tian	661bb9daae	[GlobalISel] Handle div-by-pow2 (#83155 ) This patch adds similar handling of div-by-pow2 as in `SelectionDAG`.	2024-03-29 12:41:47 -04:00
Craig Topper	23d45e55ed	[MCP] Remove dead copies from basic blocks with successors. (#86973 ) Previously we wouldn't remove dead copies from basic blocks with successors. The comment said we didn't want to trust the live-in lists. The comment is very old so I'm not sure if that's still a concern today. This patch checks the live-in lists and removes copies from MaybeDeadCopies if they are referenced by any live-ins in any successors. We only do this if the tracksLiveness property is set. If that property is not set, we retain the old behavior.	2024-03-28 14:43:49 -07:00
Shilei Tian	0a43ca731b	[AMDGPU] Fix missing `IsExact` flag when expanding vector binary operator (#86712 )	2024-03-27 17:40:58 -04:00
Kevin P. Neal	f5296df97c	[FPEnv][AMDGPU] Correct AMDGPUSimplifyLibCalls handling of strictfp attribute. (#86705 ) The AMDGPUSimplifyLibCalls pass was lowering function calls with the strictfp attribute to sequences that included function calls incorrectly lacking the attribute. This patch corrects that. The pass now also emits the correct constrained fp call instead of normal FP instructions when in a function with the strictfp attribute. Replacing non-constrained calls with constrained calls when required is still on the IRBuilder's TODO list.	2024-03-27 10:20:00 -04:00
Matt Arsenault	ef316da4a2	AMDGPU: Fix dead check prefixes in test	2024-03-27 14:42:47 +03:00
Thomas Symalla	256343a0e9	Revert "Update amdgpu_gfx functions to use s0-s3 for inreg SGPR arguments on targets using scratch instructions for stack #78226 " (#86273 ) Reverts llvm/llvm-project#81394 This reverts commit 3ac243bc0d7922d083af2cf025247b5698556062. It is not handling RSrc registers s0-s3 correctly. This leads to a broken test, where it expects s0-s3 as function argument and uses it as RSrc register as well. We need to re-visit the patch, but apparently we only want to have s0-s3 as argument registers if we don't need them as RSrc registers.	2024-03-26 11:01:08 +01:00
David Green	4d315ff382	[GlobalISel] Add CTLZ known bits. (#86436 ) Replicated from SDAG.	2024-03-26 09:11:35 +00:00
Bevin Hansson	14c30189fb	[ExpandLargeFpConvert] Fix incorrect values in fp-to-int conversion. (#86514 ) The IR for a double-to-i129 conversion looks like this in one of the blocks in compiler-rt: %cmp5.i = icmp ult i16 %3, -129, !dbg !24 But in ExpandLargeFpConvert, it looks like: %13 = icmp ult i129 %12, 4294967167, !dbg !19 ExpandLargeFpConvert is wrong; the value should have been signed before negating, but instead we get a very large unsigned value. Another value in the same pass also has this issue.	2024-03-26 10:08:22 +01:00
Changpeng Fang	350bda4419	AMDGPU: Rename intrinsics and remove f16/bf16 versions for load transpose (#86313 ) Rename the intrinsics to close to the instruction mnemonic names: Use global_load_tr_b64 and global_load_tr_b128 instead of global_load_tr. This patch also removes f16/bf16 versions of builtins/intrinsics. To simplify the design, we should avoid enumerating all possible types in implementing builtins. We can always use bitcast.	2024-03-25 16:55:22 -07:00
Jeffrey Byrnes	b761137049	[AMDGPU] Use correct VGPR threshold for flagging ExcessRP regions in unified register file case (#85860 ) `ST.getMaxNumVGPRs(MF)` lowers to `AMDGPUBaseInfo.cpp:getTotalNumVGPRs` which returns 512 for gfx90a. This is subsequently limited by `AMDGPUBaseInfo:getAddressableNumVGPRs()`, which also returns 512 for gfx90a. The ISA states we can have a total of 512 registers, but a maximum of only 256 of each of AGPR and VGPR (gfx90a 3.6.4). Therefore, in unified register file case, `ST.getMaxNumVGPRs(MF)` calculates the maximum number of combined VGPR + AGPR. But, it is currently used as the limit for accvgpr and as the limit for archvgpr. This patch uses it as the combined limit, and accounts for the maximum addressable arch/acc VGPRs when calculating the per RegClass limits. It is not unreasonable to think other clients of getTotalNumVGPRs are using it in the wrong way.	2024-03-25 13:11:58 -07:00
David Stuttard	06cfbe3cfd	[AMDPU] Add support for idxen and bothen buffer load/store merging in SILoadStoreOptimizer (#86285 ) Added more buffer instruction merging support	2024-03-25 14:44:22 +00:00
David Stuttard	75e528fdd9	[AMDGPU] Extend zero initialization of return values for TFE (#85759 ) buffer_load instructions that use TFE also need to zero initialize return values similar to how the image instructions currently work. Add support for this with standard zero init of all results + zero init of just TFE flag when enable-prt-strict-null subtarget feature is disabled.	2024-03-25 09:01:46 +00:00
Pierre van Houtryve	babbdad15b	[AMDGPU] Handle non-register operands for S_SUB/ADD_U64_PSEUDO (#86104 ) This pseudo uses SSrc_b64 so it allows both an immediate or a register, but the lowering crashed on immediate operands.	2024-03-25 09:23:40 +01:00
Evgenii Kudriashov	d365a45cb3	[GlobalISel] Introduce G_TRAP, G_DEBUGTRAP, G_UBSANTRAP (#84941 ) Here we introduce three new GMIR instructions to cover a set of trap intrinsics. The idea behind it is that generic intrinsics shouldn't be used with G_INTRINSIC opcode. These new instructions can match perfectly with existing trap ISD nodes. It allows X86, AArch64, RISCV and Mips to reuse SelectionDAG patterns for selection and avoid manual selection. However AMDGPU is an exception. It selects traps during legalization regardless SelectionDAG or GlobalISel. Since there are not many places where traps are used, this change attempts to clean up all the usages of G_INTRINSIC with trap intrinsics. So, there is no stage when both G_TRAP and G_INTRINSIC_W_SIDE_EFFECTS(@llvm.trap) are allowed.	2024-03-23 13:12:44 +01:00
Pravin Jagtap	e1a8120a63	[AMDGPU] Support double type in atomic optimizer. (#84307 ) Presently the atomic optimizer supports only 32-bit operations. Plan is to extend the atomic optimizer for 64-bit operations for compute and graphics. This patch extends support for double type for `uniform values` only. Going forward, will extend the support for divergent values. Adding support for divergent values requires extending/legalizing readfirstlane, readlane, writelane, etc ops for 64-bit operations to avoid `bitcast` noise that we have currently. --------- Authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>	2024-03-22 09:25:06 +05:30
paperchalice	a2dfc9ac7d	[NewPM][AMDGPU] Add AMDGPUPassRegistry.def (#86095 ) Move the pass registry to a separate file, prepare for porting dag-isel.	2024-03-22 08:49:29 +08:00
Jonas Paulsson	7564566779	Reapply "Move assertion for AdjustsStack from PEI to MachineVerifier (#85698 )" - The check is now actually done in both PEI and the MachineVerifier. - More .mir tests trivially updated with "adjustsStack: true" as needed.	2024-03-21 20:24:57 -04:00
SahilPatidar	3ac243bc0d	Update amdgpu_gfx functions to use s0-s3 for inreg SGPR arguments on targets using scratch instructions for stack #78226 (#81394 ) Resolve #78226	2024-03-21 16:52:08 +05:30
Pierre van Houtryve	95a834a16c	(Reland) [AMDGPU] Run LowerLDS at the end of the fullLTO pipeline (#85626 ) Reland of #75333	2024-03-21 11:44:47 +01:00
Pierre van Houtryve	ccb3a8feaa	[AMDGPU][LowerModuleLDS] Refactor partially lowered module detection (#85793 ) Refactor the logic that checks if a module contains mixed absolute/non-lowered LDS GVs. The check now happens latter when the "worklists" are formed. This is because in some cases (OpenMP) we can have non-lowered GVs in a lowered module, and this is normal because those GVs are just unused and removed from the list at some point before the end of `getUsesOfLDSByFunction`. Doing the check later ensures that if a mixed module is spotted, then it's a _real_ mixed module that needs rejection, not a module containing an intentionally ignored GV.	2024-03-21 11:28:35 +01:00
Matt Arsenault	b6b703b2df	AMDGPU: Infer no-agpr usage in AMDGPUAttributor (#85948 ) SIMachineFunctionInfo has a scan of the function body for inline asm which may use AGPRs, or callees in SIMachineFunctionInfo. Move this into the attributor, so it actually works interprocedurally. Could probably avoid most of the test churn if this bothered to avoid adding this on subtargets without AGPRs. We should also probably try to delete the MIR scan in usesAGPRs but it seems to be trickier to eliminate.	2024-03-21 14:24:06 +05:30
Thorsten Schütt	deefe3fbc9	[GlobalIsel] Post-review combine ADDO (#85961 ) https://github.com/llvm/llvm-project/pull/82927	2024-03-21 03:56:40 +01:00
Jonas Paulsson	9ebd329ad8	Revert "Move assertion for AdjustsStack from PEI to MachineVerifier. (#85698 )" This reverts commit 05bde30585710a51592eee0a6cf6df8184d09c92. Reverting due to verifier complaints with expensive checks on build-bot.	2024-03-20 11:48:30 -04:00
Jonas Paulsson	05bde30585	Move assertion for AdjustsStack from PEI to MachineVerifier. (#85698 ) Have the verifier report a missing AdjustsStack flag rather than waiting until PEI asserts.	2024-03-20 10:29:12 -04:00
Pravin Jagtap	e52a687871	[AMDGPU][NFC] Test clean up (#85922 ) Authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>	2024-03-20 17:29:42 +05:30
Pravin Jagtap	070d1e8321	[AMDGPU] Add test for fpext & fptrunc with bf16. (#85909 ) Authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>	2024-03-20 14:45:38 +05:30
Peter Rong	4a026b5092	[AMDGCN] Use ZExt when handling indices in insertment element (#85718 ) When i1 true is used as an index, SExt extends it to i32 -1. This would cause BitVector to overflow. The language manual have specified that the index shall be treated as an unsigned number, this patch fixes that. (https://llvm.org/docs/LangRef.html#insertelement-instruction) This patch fixes #85717 --------- Signed-off-by: Peter Rong <PeterRong96@gmail.com>	2024-03-19 21:44:08 -07:00
Changpeng Fang	ab76052fa9	AMDGPU: Treat SWMMAC the same as MFMA and other WMMA for sched_barrier (#85721 )	2024-03-19 09:58:09 -07:00
Pravin Jagtap	08701e35ed	[AMDGPU][NFC] Test clean up. (#85775 ) Added common check for DPP and Iterative strategies for uniform value case since optimization applied is same. Authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>	2024-03-19 18:00:34 +05:30
Pierre van Houtryve	953c13b5c9	[AMDGPU][PromoteAlloca] Whole-function alloca promotion to vector (#84735 ) Update PromoteAllocaToVector so it considers the whole function before promoting allocas. Allocas are scored & sorted so the highest value ones are seen first. The budget is now per function instead of per alloca. Passed internal performance testing.	2024-03-19 11:49:22 +01:00
Jonas Paulsson	09bc6abba6	[MachineFrameInfo] Refactoring around computeMaxcallFrameSize() (NFC) (#78001 ) - Use computeMaxCallFrameSize() in PEI::calculateCallFrameInfo() instead of duplicating the code. - Set AdjustsStack in FinalizeISel instead of in computeMaxCallFrameSize().	2024-03-18 10:37:59 -04:00
Yingwei Zheng	38a44bdc93	[CodeGenPrepare] Reverse the canonicalization of isInf/isNanOrInf (#81572 ) In commit `2b582440c1`, we canonicalize the isInf/isNanOrInf idiom into fabs+fcmp for better analysis/codegen (See also the discussion in https://github.com/llvm/llvm-project/pull/76338). This patch reverses the fabs+fcmp to `is.fpclass`. If the `is.fpclass` is not supported by the target, it will be expanded by TLI. Fixes the regression introduced by `2b582440c1` and https://github.com/llvm/llvm-project/pull/80414#issuecomment-1936374206.	2024-03-18 18:27:45 +08:00
pvanhout	3493438605	Revert "[AMDGPU] Run LowerLDS at the end of the fullLTO pipeline (#75333 )" This reverts commit 9b98692eedb78aa106539c36ba02944f32cae1ff.	2024-03-18 11:18:57 +01:00
Pierre van Houtryve	9b98692eed	[AMDGPU] Run LowerLDS at the end of the fullLTO pipeline (#75333 ) This change allows us to use `--lto-partitions` in some cases (not at all guaranteed it works perfectly), as LDS is lowered before the module is split for parallel codegen. We must run LowerLDS before splitting modules as it needs to see all callers of functions with LDS to properly lower them.	2024-03-18 09:09:43 +01:00
Sameer Sahasrabuddhe	ec34699f75	[GlobalISel] convergence control tokens and intrinsics (#67006 ) [GlobalISel] Implement convergence control tokens and intrinsics in GMIR In the IR translator, convert the LLVM token type to LLT::token(), which is an alias for the s0 type. These show up as implicit uses on convergent operations. Differential Revision: https://reviews.llvm.org/D158147	2024-03-18 10:34:11 +05:30

1 2 3 4 5 ...

7296 Commits