llvm-project

Author	SHA1	Message	Date
Simon Pilgrim	e9caa37e9c	[DAG] Move lshr narrowing from visitANDLike to SimplifyDemandedBits Inspired by some of the cases from D145468 Let SimplifyDemandedBits handle the narrowing of lshr to half-width if we don't require the upper bits, the narrowed shift is profitable and the zext/trunc are free. A future patch will propose the equivalent shl narrowing combine. Differential Revision: https://reviews.llvm.org/D146121	2023-07-17 15:50:09 +01:00
Jay Foad	92542f2a40	[AMDGPU] Add targets gfx1150 and gfx1151 This is the target definition only. Currently they are treated the same as GFX 11.0.x. Differential Revision: https://reviews.llvm.org/D155429	2023-07-17 13:06:12 +01:00
Jay Foad	a2453c6130	[AMDGPU] Add test case for zext of f16 to i32 Preserve the test case from this abandoned review: D51925 [AMDGPU] Fix issue for zext of f16 to i32	2023-07-17 12:55:29 +01:00
Jay Foad	a1a9c53ae7	[GlobalISel] Fix infinite loop in reassociation combine Don't reassociate (C1+C2)+Y -> C1+(C2+Y). Fixes https://github.com/llvm/llvm-project/issues/63849 Differential Revision: https://reviews.llvm.org/D155284	2023-07-16 14:15:24 +01:00
Jon Chesterfield	6043d4dfec	[amdgpu] Accept an optional max to amdgpu-lds-size attribute for use in PromoteAlloca	2023-07-15 21:37:21 +01:00
Matt Arsenault	ef4a2b6096	AMDGPU: Expand testing of AMDGPUCodeGenPrepare fdiv handling - Switch to generated checks - Use a different run line per denormal mode to reduce test duplication - Add test coverage for rsqrt cases - Add test coverage for repeated arcp denominator - Fix the optnone test	2023-07-14 18:57:40 -04:00
Konstantina Mitropoulou	21ca892f69	[NFC][AMDGPU] Add automated tests in or.ll Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D155265	2023-07-14 12:39:14 -07:00
pvanhout	e5296c52e5	[AMDGPU] Relax restrictions on unbreakable PHI users in BreakLargePHis The previous heuristic rejected a PHI if one of its user was an unbreakable PHI, no matter what the other users were. This worked well in most cases, but there's one case in rocRAND where it doesn't work. In that case, a PHI node has 2 PHI users where one is breakable but not the other. When that PHI node isn't broken performance falls by 35%. Relaxing the restriction to "require that half of the PHI node users are breakable" fixes the issue, and seems like a sensible change. Solves SWDEV-409648, SWDEV-398393 Reviewed By: #amdgpu, arsenm Differential Revision: https://reviews.llvm.org/D155184	2023-07-14 09:02:51 +02:00
Jon Chesterfield	d3316bc111	[amdgpu] Delete elide-module-lds attribute Requires D155190 Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D155238	2023-07-14 00:36:33 +01:00
Jon Chesterfield	74e928a081	[amdgpu][lds] Remove recalculation of LDS frame from backend Do the LDS frame calculation once, in the IR pass, instead of repeating the work in the backend. Prior to this patch: The IR lowering pass sets up a per-kernel LDS frame and annotates the variables with absolute_symbol metadata so that the assembler can build lookup tables out of it. There is a fragile association between kernel functions and named structs which is used to recompute the frame layout in the backend, with fatal_errors catching inconsistencies in the second calculation. After this patch: The IR lowering pass additionally sets a frame size attribute on kernels. The backend uses the same absolute_symbol metadata that the assembler uses to place objects within that frame size. Deleted the now dead allocation code from the backend. Left for a later cleanup: - enabling lowering for anonymous functions - removing the elide-module-lds attribute (test churn, it's not used by llc any more) - adjusting the dynamic alignment check to not use symbol names Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D155190	2023-07-13 23:54:38 +01:00
Jeffrey Byrnes	6b7805fcb1	[AMDGPU][IGLP] Add iglp_opt(1) strategy for single wave gemms This adds the IGLP strategy for single-wave gemms. The SchedGroup pipeline is laid out in multiple phases, with each phase corresponding to a distinct pattern present in gemm kernels. The resilience of the optimization is dependent upon IR (as seen by pre-RA scheduling) continuing to have these patterns (as defined by instruction class and dependencies) in their current relative ordering. The kernels of interest have these specific phases: NT: 1, 2a, 2c NN: 1, 2a, 2b TT: 1, 2b, 2c TN: 1, 2b The general approach taken was to have a long SchedGroup pipeline. In this way the scheduler will have less capability of doing the wrong thing. In order to resolve the challenge of correctly fitting these long pipelines, we leverage the rules infrastructure to help the solver. Differential Revision: https://reviews.llvm.org/D149773 Change-Id: I1a35962a95b4bdf740602b8f110d3297c6fb9d96	2023-07-13 12:03:04 -07:00
Ivan Kosarev	289ae6525d	[AMDGPU][MC] Fix handling of A16 operands in intersect_ray instructions. The patch adds the support for 'noa16' operands in non-A16 variants of the instructions, fixes validation of A16 operands and eliminates the custom conversion to MCInst. Part of <https://github.com/llvm/llvm-project/issues/62629>. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D155057	2023-07-13 19:46:03 +01:00
Mateja Marjanovic	fa46feb314	[AMDGPU] Use V_FMA_MIX* more often Combine mul (f32) + fptrunc (f32->f16) to "v_fma_mixlo_f16 mulSrc1, mulSrc2, 0". Differential Revision: https://reviews.llvm.org/D153544 Reviewers: arsenm, foad	2023-07-13 16:56:16 +02:00
Mateja Marjanovic	d3140f9363	Precommit for more usage of V_FMA/MAD_MIX* Make fdiv.f16.ll autogenerated.	2023-07-13 16:26:21 +02:00
pvanhout	07c5920487	Reland "[AMDGPU] Wave32 CodeGen for amdgcn.ballot.i64" This time without the extra `->dump()` A recent addition to the device libs, `__ockl_dm_trim`, caused a series of failures at O0 due to a i64 ballot intrinsic being inlined into a wave32 function. The quick fix for this is to support codegen for this rare case. A proper long-term fix for this type of issue is still being discussed. Fixes SWDEV-408929, SWDEV-408957, SWDEV-409885, SWDEV-410193 Reviewed By: #amdgpu, arsenm Differential Revision: https://reviews.llvm.org/D155050	2023-07-13 15:58:48 +02:00
pvanhout	aec971adec	Revert "[AMDGPU] Wave32 CodeGen for amdgcn.ballot.i64" This reverts commit cfa2d0a3aa0beb5422107dc9943cb0eae6d93896.	2023-07-13 15:52:27 +02:00
Mateja Marjanovic	701c4adcea	Check for denormal flushing when selecting V_FMA/MAD_MIX*	2023-07-13 15:26:20 +02:00
pvanhout	cfa2d0a3aa	[AMDGPU] Wave32 CodeGen for amdgcn.ballot.i64 A recent addition to the device libs, `__ockl_dm_trim`, caused a series of failures at O0 due to a i64 ballot intrinsic being inlined into a wave32 function. The quick fix for this is to support codegen for this rare case. A proper long-term fix for this type of issue is still being discussed. Fixes SWDEV-408929, SWDEV-408957, SWDEV-409885, SWDEV-410193 Reviewed By: #amdgpu, arsenm Differential Revision: https://reviews.llvm.org/D155050	2023-07-13 15:20:58 +02:00
Jon Chesterfield	9418c40af7	[amdgpu][lds] Raise an explicit unimplemented error on absolute address LDS variables These aren't implemented. They could be at moderate implementation complexity. Raising an error is better than silently miscompiling. Patching now because the patch at D155125 is a step towards using this metadata more extensively as part of the lowering path and that will interact badly with input variables with this annotation. Lowering user defined variables at specific addresses would drop this error, put them at the requested position in the frame during this pass, and then use the same codegen that will be used for the kernel specific struct shortly. Reviewed By: jmmartinez Differential Revision: https://reviews.llvm.org/D155132	2023-07-13 11:32:03 +01:00
pvanhout	361e9eec51	[AMDGPU] Corrrectly emit AGPR copies in tryFoldPhiAGPR - Don't create COPY instructions between PHI nodes. - Don't create V_ACCVGPR_WRITE with operands that aren't AGPR_32 Solves SWDEV-410408 Reviewed By: #amdgpu, arsenm Differential Revision: https://reviews.llvm.org/D155080	2023-07-13 08:55:22 +02:00
Jingu Kang	33e60484d7	[MachineLICM] Handle Subloops MachineLICM pass handles inner loops only when outmost loop does not have unique predecessor. If the loop has preheader and there is loop invariant code, the invariant code can be hoisted to the preheader in general. This patch makes the pass handle inner loops in general. Differential Revision: https://reviews.llvm.org/D154205	2023-07-12 16:32:14 +01:00
Ivan Kosarev	15e7749e19	[Codegen] Generate fast fp64-to-fp16 conversions in unsafe mode. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D154528	2023-07-12 11:55:19 +01:00
David Stenberg	6aa94c64a5	[DWARF] Add printout for op-index This is a preparatory patch for extending DWARFDebugLine to properly parse line number programs with maximum_operations_per_instruction > 1 for VLIW targets. Add some scaffolding for handling op-index in line number programs, and add printouts for that in the table. As this affects a lot of tests, this is done in a separate commit to get a cleaner review for the actual op-index implementation. Verbose printouts are not present in many tests, and adding op-index to those will require a bit more code changes, so that is done in the actual implementation patch. Reviewed By: StephenTozer Differential Revision: https://reviews.llvm.org/D152535	2023-07-12 12:03:44 +02:00
Jay Foad	f7684d8510	[DAG] Use legal shift amount type in DAGTypeLegalizer::JoinIntegers Documentation for TargetLowering::getShiftAmountTy says that LegalTypes should generally be true during type legalization, so this patch does that. On AMDGPU the effect is that we use i32 (a sane type) instead of i64 (pointer sized type) for more shift amounts, which in turn allows more formation of rotates and funnel shifts pre-legalization. Differential Revision: https://reviews.llvm.org/D154960	2023-07-12 08:12:09 +01:00
Jon Chesterfield	e75ce77cd7	[amdgpu][lds] Fix missing markUsedByKernel calls and undef lookup table elements More robust association between the kernels and lds struct. Use poison instead of value() for lookup table elements introduced by dynamic lds lowering. Extracted from D154946, new test from there verbatim. Segv fixed. Fixes issues/63338 Fixes SWDEV-404491 Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D154972	2023-07-12 00:37:21 +01:00
Matt Arsenault	b59022b42e	DAG: Handle lowering of unordered fcZero\|fcSubnormal to fcmp	2023-07-11 18:30:15 -04:00
Jon Chesterfield	980cd18354	[amdgpu][nfc] Drop lds strategy noise from some tests	2023-07-11 21:18:49 +01:00
Matt Arsenault	fbe4ff8149	AMDGPU: Partially fix not respecting dynamic denormal mode The most notable issue was producing v_mad_f32 in functions with the dynamic mode, since it just ignores the mode. fdiv lowering is still somewhat broken because it involves a mode switch and we need to query the original mode.	2023-07-11 15:14:52 -04:00
Amara Emerson	3a80bdb316	[GlobalISel] Remove an erroneous oneuse check in the G_ADD reassociation combine. This check was unnecessary/incorrect, it was already being done by the target hook default implementation, and the one in the matcher was checking for a completely different thing. This change: 1) Removes the check and updates affected tests which now do some more reassociations. 2) Modifies the AMDGPU hooks which were stubbed with "return true" to also do the oneuse check. Not sure why I didn't do this the first time.	2023-07-10 01:03:12 -07:00
Matt Arsenault	64d325454b	AMDGPU: Delete custom combine on class intrinsic This is no longer necessary as class-with-constant will always be transformed to the generic class intrinsic. https://reviews.llvm.org/D153901	2023-07-07 15:28:21 -04:00
Christudasan Devadasan	7a98f084c4	[AMDGPU][SILowerSGPRSpills] Spill SGPRs to virtual VGPRs Currently, the custom SGPR spill lowering pass spills SGPRs into physical VGPR lanes and the remaining VGPRs are used by regalloc for vector regclass allocation. This imposes many restrictions that we ended up with unsuccessful SGPR spilling when there won't be enough VGPRs and we are forced to spill the leftover into memory during PEI. The custom spill handling during PEI has many edge cases and often breaks the compiler time to time. This patch implements spilling SGPRs into virtual VGPR lanes. Since we now split the register allocation for SGPRs and VGPRs, the virtual registers introduced for the spill lanes would get allocated automatically in the subsequent regalloc invocation for VGPRs. Spill to virtual registers will always be successful, even in the high-pressure situations, and hence it avoids most of the edge cases during PEI. We are now left with only the custom SGPR spills during PEI for special registers like the frame pointer which is an unproblematic case. Differential Revision: https://reviews.llvm.org/D124196	2023-07-07 23:14:32 +05:30
Christudasan Devadasan	b4a62b1fa5	[AMDGPU] Enable whole wave register copy So far, we haven't exposed the allocation of whole-wave registers to regalloc. We hand-picked them for various whole wave mode operations. With a future patch, we want the allocator to efficiently allocate them rather than using the custom pre-allocation pass. Any liverange split of virtual registers involved in whole-wave operations require the resulting COPY introduced with the split to be performed for all lanes. It isn't implemented in the compiler yet. This patch would identify all such copies and manipulate the exec mask around them to enable all lanes without affecting the value of exec mask elsewhere. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D143762	2023-07-07 22:58:55 +05:30
Christudasan Devadasan	b78b36e1a2	[AMDGPU] Implement whole wave register spill To reduce the register pressure during allocation, when the allocator spills a virtual register that corresponds to a whole wave mode operation, the spill loads and restores should be activated for all lanes by temporarily flipping all bits in exec register to one just before the spills. It is not implemented in the compiler as of today and this patch enables the necessary support. This is a pre-patch before the SGPR spill to virtual VGPR lanes that would eventually causes the whole wave register spills during allocation. Reviewed By: arsenm, cdevadas Differential Revision: https://reviews.llvm.org/D143759	2023-07-07 22:51:45 +05:30
Matt Arsenault	94e24624c2	AMDGPU: Remove attempt at simplifying the format string in printf lowering This avoids computing the dominator tree by removing the simplifyInstruction use. This was applying simplification with some kind of questionable load-store forwarding and looking for the global. This had to have been an ancient hack copied from previous backends. In the OpenCL case, this is always emitted as required the direct global reference anyway.	2023-07-07 09:26:07 -04:00
Matt Arsenault	64df9573a7	DAG: Handle inversion of fcSubnormal \| fcZero There are a number of more test combinations here that can be done together and reduce the number of instructions. https://reviews.llvm.org/D143191	2023-07-06 21:19:44 -04:00
Matt Arsenault	61820f8b5d	CodeGen: Optimize lowering of is.fpclass fcZero\|fcSubnormal Combine the two checks into a check if the exponent bits are 0. The inverted case isn't reachable until a future change, and GlobalISel currently doesn't attempt the inversion optimization. https://reviews.llvm.org/D143182	2023-07-06 13:03:57 -04:00
Matt Arsenault	9df70e4a4d	AMDGPU: Fix not applying the correct default memcpy expansion threshold Fixes 3c848194f28decca41b7362f9dd35d4939797724. The TTI hook name got renamed at some point in the process and the target implementation was left behind. Fixes: SWDEV-407329	2023-07-06 12:14:14 -04:00
Matt Arsenault	c70cae6315	AMDGPU: Make SIFixVGPRCopies preserve everything All this does is add uses of reserved registers, which aren't tracked by anything. Saves a loop info computation.	2023-07-06 10:26:21 -04:00
Matt Arsenault	8ee1cc82c9	AMDGPU: Fold out sign bit ops on frexp_exp The sign bit has no impact on the exponent, so strip these away. Saves on the source modifier encoding cost. I left the GlobalISel handling until there's a resolution to issue #62628. We should do this in instcombine too, but legalization should be introducing more frexps than it currently is where this would occur.	2023-07-06 10:26:21 -04:00
Ivan Kosarev	b4049b409b	[AMDGPU] Add GlobalISel test coverage for floating-point truncations. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D154527	2023-07-06 11:37:09 +01:00
Valery Pykhtin	98aa8439f5	[AMDGPU] Fix register class for a subreg in GCNRewritePartialRegUses. 1. Improved code that deduces register class from instruction definitions. Previously if some instruction didn't contain a reg class for an operand it was considered as no information on register class even if other instructions specified the class. 2. Added check on required size of resulting register because in some cases classes with smaller registers had been selected (for example VReg_1). Reviewed By: arsenm, #amdgpu Differential Revision: https://reviews.llvm.org/D152832	2023-07-06 08:48:45 +02:00
Matt Arsenault	20964c901a	DAG: Fix dropping flags when widening unary vector ops	2023-07-05 17:25:24 -04:00
Matt Arsenault	5491666248	AMDGPU: Correctly lower llvm.exp.f32 The library expansion has too many paths for all the permutations of DAZ, unsafe and the 3 exp functions. It's easier to expand it in the backend when we know all of these things. The library currently misses the no-infinity check on the overflow, which this handles optimizing out. Some of the <3 x half> fast tests regress due to vector widening dropping flags which will be fixed separately. Apparently there is no exp10 intrinsic, but there should be. Adds some deadish code in preparation for adding one while I'm following along with the current library expansion.	2023-07-05 17:23:49 -04:00
Matt Arsenault	ed556a1ad5	AMDGPU: Correctly lower llvm.exp2.f32 Previously this did a fast math expansion only.	2023-07-05 17:23:48 -04:00
Matt Arsenault	9c82dc6a6b	AMDGPU: Always use v_rcp_f16 and v_rsq_f16 These inherited the fast math checks from f32, but the manual suggests these should be accurate enough for unconditional use. The definition of correctly rounded is 0.5ulp, but the manual says "0.51ulp". I've been a bit nervous about changing this as the OpenCL conformance test does not cover half. Brute force produces identical values compared to a reference host implementation for all values.	2023-07-05 16:53:01 -04:00
Matt Arsenault	59c311c5d4	AMDGPU: Add more tests for f16 fdiv lowering Probably should merge the DAG and gisel tests.	2023-07-05 16:53:01 -04:00
Matt Arsenault	4e15f378ee	AMDGPU: Correctly lower llvm.log.f32 and llvm.log10.f32 Previously we expanded these in a fast-math way and the device libraries were relying on this behavior. The libraries have a pending change to switch to the new target intrinsic. Unlike the library version, this takes advantage of no-infinities on the result overflow check.	2023-07-05 15:30:35 -04:00
Stephen Thomas	2dfb4b56fe	[AMDGPU] Fix incorrect hazard mitigation GCNHazardRecognizer::fixVcmpxExecWARHazard() mitigates a specific hazard by inserting a wait on sa_sdst==0 if such a wait isn't already present. Unfortunately, the check for an existing wait incorrectly checks for one that doesn't actually care about sa_sdst itself, but requires that no other counters are waited for. Once the check is performed correctly, a lit test needs to be updated, since it is currently testing for the incorrect behaviour. Differential Revision: https://reviews.llvm.org/D154438	2023-07-04 14:42:51 +01:00
Jay Foad	f2c164c815	[AMDGPU] Do not wait for vscnt on function entry and return SIInsertWaitcnts inserts waitcnt instructions to resolve data dependencies. The GFX10+ vscnt (VMEM store count) counter is never used in this way. It is only used to resolve memory dependencies, and that is handled by SIMemoryLegalizer. Hence there is no need to conservatively wait for vscnt to be 0 on function entry and before returns. Differential Revision: https://reviews.llvm.org/D153537	2023-07-04 12:22:38 +01:00
Matt Arsenault	8f9eee3602	AMDGPU: Fix opaque pointer conversion error in test The * was in the wrong place so this was missed by the script.	2023-06-30 15:04:03 -04:00

1 2 3 4 5 ...

6562 Commits