llvm-project

Author	SHA1	Message	Date
Stanislav Mekhanoshin	84f398af74	[AMDGPU] Add missing test checks. NFC. (#69484 )	2023-10-18 11:26:39 -07:00
Stanislav Mekhanoshin	47ed921985	[AMDGPU] Add legality check when folding short 64-bit literals (#69391 ) We can only fold it if it can fit into 32-bit. I believe it did not trigger yet because we do not select 64-bit literals generally.	2023-10-18 09:22:23 -07:00
Sirish Pande	28e4f97320	[AMDGPU] Save/Restore SCC bit across waterfall loop. (#68363 ) Waterfall loop is overwriting SCC bit of status register. Make sure SCC bit is saved and restored across. We need to save/restore only in cases where SCC is live across waterfall loop. Co-authored-by: Sirish Pande <sirish.pande@amd.com>	2023-10-18 08:43:29 -05:00
pvanhout	868abf0961	Revert "[AMDGPU] Remove Code Object V3 (#67118 )" This reverts commit 544d91280c26fd5f7acd70eac4d667863562f4cc.	2023-10-18 12:55:36 +02:00
Jay Foad	104db26004	[AMDGPU] Fix image intrinsic optimizer on loads from different resources (#69355 ) The image intrinsic optimizer pass was neglecting to check any arguments of the load intrinsic after the VAddr arguments. For example multiple loads from different resources should not have been combined but were, because the pass was not checking the resource argument.	2023-10-18 11:08:01 +01:00
Pierre van Houtryve	c464fea779	[DAG] Constant fold FMAD (#69324 ) This has very little effect on codegen in practice, but is a nice to have I think. See #68315	2023-10-18 07:46:24 +02:00
Stanislav Mekhanoshin	a22a1fe151	[AMDGPU] support 64-bit immediates in SIInstrInfo::FoldImmediate (#69260 ) This is a part of https://github.com/llvm/llvm-project/issues/67781. Until we select more 64-bit move immediates the impact is minimal.	2023-10-17 10:53:22 -07:00
Guozhi Wei	760e7d00d1	[X86, Peephole] Enable FoldImmediate for X86 Enable FoldImmediate for X86 by implementing X86InstrInfo::FoldImmediate. Also enhanced peephole by deleting identical instructions after FoldImmediate. Differential Revision: https://reviews.llvm.org/D151848	2023-10-17 16:22:42 +00:00
Pierre van Houtryve	cc3d2533cc	[AMDGPU] Add i1 mul patterns (#67291 ) i1 muls can sometimes happen after SCEV. They resulted in ISel failures because we were missing the patterns for them. Solves SWDEV-423354	2023-10-16 16:18:27 +02:00
Pierre van Houtryve	4d6fc88946	[AMDGPU] Add patterns for V_CMP_O/U (#69157 ) Fixes SWDEV-427162	2023-10-16 13:07:56 +02:00
Pierre van Houtryve	544d91280c	[AMDGPU] Remove Code Object V3 (#67118 ) V3 has been deprecated for a while as well, so it can safely be removed like V2 was removed. - [Clang] Set minimum code object version to 4 - [lld] Fix tests using code object v3 - Remove code object V3 from the AMDGPU backend, and delete or port v3 tests to v4. - Update docs to make it clear V3 can no longer be emitted.	2023-10-16 08:21:48 +02:00
Stephen Thomas	720be6c535	[AMDGPU] Add encoding/decoding support for non-result-returning ATOMIC_CSUB instructions (#68684 ) The BUFFER_ATOMIC_CSUB and GLOBAL_ATOMIC_CSUB instructions have encodings for non-value-returning forms, although actually using them isn't supported by hardware. However, these encodings aren't supported by the backend, meaning that they can't even be assembled or disassembled. Add support for the non-returning encodings, but gate actually using them in instruction selection behind a new feature FeatureAtomicCSubNoRtnInsts, which no target uses. This does allow the non-returning instructions to be tested manually and llvm.amdgcn.atomic.csub.ll is extended to cover them. The feature does not gate assembling or disassembling them, this is now not an error, and encoding and decoding tests have been adapted accordingly.	2023-10-11 11:37:27 +01:00
Piotr Sobczak	2888fa4313	[AMDGPU] Update test remat-smrd.mir Update test/CodeGen/AMDGPU/remat-smrd.mir: * Convert a negative case of non-dereferenceable invariant load to positive one. * Add new cases for subreg.	2023-10-11 10:19:22 +02:00
Thomas Symalla	aa5158cd1e	[AMDGPU] Use absolute relocations when compiling for AMDPAL and Mesa3D (#67791 ) The primary ISA-independent justification for using PC-relative addressing is that it makes code position-independent and therefore allows sharing of .text pages between processes. When not sharing .text pages, we can use absolute relocations instead, which will possibly prevent a bubble introduced by s_getpc_b64. Co-authored-by: Thomas Symalla <thomas.symalla@amd.com>	2023-10-10 09:22:02 +02:00
Jay Foad	7b3bbd83c0	Revert "[CodeGen] Really renumber slot indexes before register allocation (#67038 )" This reverts commit 2501ae58e3bb9a70d279a56d7b3a0ed70a8a852c. Reverted due to various buildbot failures.	2023-10-09 12:31:32 +01:00
Jay Foad	2501ae58e3	[CodeGen] Really renumber slot indexes before register allocation (#67038 ) PR #66334 tried to renumber slot indexes before register allocation, but the numbering was still affected by list entries for instructions which had been erased. Fix this to make the register allocator's live range length heuristics even less dependent on the history of how instructions have been added to and removed from SlotIndexes's maps.	2023-10-09 11:44:41 +01:00
Jeffrey Byrnes	6afceba510	[AMDGPU][IGLP] SingleWaveOpt: Cache DSW Counters from PreRA (#67759 ) Save the DSW counters from PreRA scheduling. While this avoids recalculation in the postRA pass, that isn't the main purpose. This is required because of physical register dependencies in PostRA scheduling -- they alter the DAG s.t. our counters may become incorrect -- which alters the layout of the pipeline. By preserving the values from PreRA, we can be sure that we accurately construct the pipeline. Additionally, remove a bad assert in SharesPredWithPrevNthGroup -- it is possible that we will have an empty cache if OtherGroup has no elements which have a V_PERM pred (possible if the V_PERM SG is empty).	2023-10-06 17:34:14 -07:00
Petar Avramovic	2fa7d652d0	AMDGPU: Fix temporal divergence introduced by machine-sink (#67456 ) Temporal divergence that was present in input or introduced in IR transforms, like code-sinking or LICM, is handled in SIFixSGPRCopies by changing sgpr source instr to vgpr instr. After 5b657f5, that moved LICM after AMDGPUCodeGenPrepare, machine-sinking can introduce temporal divergence by sinking instructions outside of the cycle. Add isSafeToSink callback in TargetInstrInfo.	2023-10-06 15:00:08 +02:00
Petar Avramovic	2d7fe90a3e	AMDGPU: Add test for temporal divergence introduced by machine-sink Introduced by 5b657f50b8e8dc5836fb80e566ca7569fd04c26f that moved LICM after AMDGPUCodeGenPrepare. Some instructions are no longer sunk during ir optimizations but in machine-sinking instead. If vgpr instruction used sgpr defined inside the cycle is sunk outside of the cycle we end up with not-handled case of temporal divergence. Add test for theoretical case when SALU instruction (represents uniform value) is sunk outside of the cycle. Add a test when SALU instruction can be sunk if it edits lane mask.	2023-10-06 15:00:08 +02:00
Petar Avramovic	ccf68ab432	Revert "MachineSink: Fix sinking VGPR def out of a divergent loop" This reverts commit 3f8ef57bede94445b1a1042c987cc914a886e7ff.	2023-10-06 15:00:08 +02:00
Diana Picus	2e1718adc8	Reland "AMDGPU: Duplicate instead of COPY constants from VGPR to SGPR (#66882 )" Teach the si-fix-sgpr-copies pass to deal with REG_SEQUENCE, PHI or INSERT_SUBREG where the result is an SGPR, but some of the inputs are constants materialized into VGPRs. This may happen in cases where for instance several instructions use an immediate zero and SelectionDAG chooses to put it in a VGPR to satisfy all of them. This however causes the si-fix-sgpr-copies to try to switch the whole chain to VGPR and may lead to illegal VGPR-to-SGPR copies. Rematerializing the constant into an SGPR fixes the issue. This was originally reverted because it triggered an unrelated bug in PEI on one of the OpenMP buildbots. That bug has been fixed in #68299, so it should be ok to try again.	2023-10-06 10:03:50 +02:00
Diana	be382de059	[AMDGPU] Use correct operand order for shifts (#68299 ) In a special case in frame index elimination (when the offset is 0), we generate either a S_LSHR_B32 or a V_LSHRREV_B32 using the same code. However, they don't expect their operands in the same order - S_LSHR_B32 takes the value to be shifted first and then the shift amount, whereas V_LSHRREV_B32 has the operands reversed (hence the REV in its name). Update the code & tests to take this into account. Also remove an outdated comment (this code is definitely reachable now that non-entry functions no longer have a fixed emergency scavenge slot).	2023-10-06 09:43:04 +02:00
Matt Arsenault	5082e827c1	AMDGPU/GlobalISel: Add test for packed sub selection Mirror of the add test, I've had this lying around for a long time.	2023-10-05 10:07:57 -07:00
Matt Arsenault	b5ebf07499	AMDGPU/GlobalISel: Add global-isel run lines to shrink add/sub test	2023-10-05 10:07:57 -07:00
Matt Arsenault	2ca30eb8fd	AMDGPU/GlobalISel: Handle mubuf load/store for more types (#68268 ) Fixes MUBUF path for most vectors and pointers, which unblocks fixing the gfx6/7 run lines in assorted tests. Also fixes inconsistent behavior for -flat-for-global.	2023-10-05 05:36:16 -07:00
Ivan Kosarev	f04aa1f814	[AMDGPU][CodeGen] Fold immediates in src1 operands of V_MAD/MAC/FMA/FMAC. (#68002 )	2023-10-05 14:22:29 +03:00
Kirill Stoimenov	0a776996af	Revert "[DAG] Attempt shl narrowing in SimplifyDemandedBits" This reverts commit 7a8c04ef84ecdab4390b451d4c2fe17bc45a7b63.	2023-10-04 22:15:41 +00:00
Jeffrey Byrnes	7794e16b49	[AMDGPU]: Allow combining into v_dot4 Differential Revision: https://reviews.llvm.org/D155995 Change-Id: Id15d232629a32a3549b13d47bf84d7a61b28b928	2023-10-04 13:31:36 -07:00
Alex Richardson	e86d6a43f0	Regenerate test checks for tests affected by D141060	2023-10-04 10:51:35 -07:00
Alex Richardson	83c4227ab7	Auto-generate test checks for tests affected by D141060 These files had manual CHECK lines which make the diff from D141060 very difficult to review.	2023-10-04 10:51:35 -07:00
Ivan Kosarev	cf80defae2	[AMDGPU][GFX11] Do not rewrite V_FMA/FMAC_* to V_FMAAK_F16_t16 on operand legalization. (#66202 ) V_FMAAK_F16_t16 takes VGPR_32_Lo128 operands whereas the original instructions would have VGPR_32 operands. Switching the opcodes without updating operands' register classes leads to MachineVerifier complaining about the classes not matching instruction definitions. The problem only reveals itself of builds with expensive checks enabled because of missing -verify-machineinstrs in the test. This is the third attempt to update CodeGen/AMDGPU/fma.f16.ll to run for GFX11, following the second attempt in a1e38e0b8e3e, partially reverted in eaf737a4e004.	2023-10-04 12:41:46 +01:00
Simon Pilgrim	7a8c04ef84	[DAG] Attempt shl narrowing in SimplifyDemandedBits If a shl node leaves the upper half bits zero / undemanded, then see if we can profitably perform this with a half-width shl and a free trunc/zext. Followup to D146121 Differential Revision: https://reviews.llvm.org/D155472	2023-10-04 10:23:02 +01:00
JP Lehr	e816c89c84	Revert "InlineSpiller: Consider if all subranges are the same when avoiding redundant spills" This reverts commit d8127b2ba8a87a610851b9a462f2fc2526c36e37.	2023-10-02 06:26:33 -05:00
Matt Arsenault	d8127b2ba8	InlineSpiller: Consider if all subranges are the same when avoiding redundant spills This avoids some redundant spills of subranges, and avoids a compile failure. This greatly reduces the numbers of spills in a loop. The main range is not informative when multiple instructions are needed to fully define a register. A common scenario is a lowered reg_sequence where every subregister is sequentially defined, but each def changes the main range's value number. If we look at specific lanes at the use index, we can see the value is actually the same. In this testcase, there are a large number of materialized 64-bit constant defs which are hoisted outside of the loop by MachineLICM. These are feeding REG_SEQUENCES, which is not considered rematerializable inside the loop. After coalescing, the split constant defs produce main ranges with an apparent phi def. There's no phi def if you look at each individual subrange, and only half of the register is really redefined to a constant. Fixes: SWDEV-380865 https://reviews.llvm.org/D147079	2023-10-01 11:37:53 +03:00
Matt Arsenault	7252787dd9	RegAllocGreedy: Fix detection of lanes read by a bundle SplitKit creates questionably formed bundles of copies when it needs to copy a subset of live lanes and can't do it with a single subregister index. These are merely marked as part of a bundle, and don't start with a BUNDLE instruction. Queries for the slot index would give the first copy in the bundle, and we need to inspect the operands of all the other bundled copies. Also fix and simplify detection of read lane subsets. This causes some RISCV test regressions, but these look like accidentally beneficial splits. I don't see a subrange based reason to perform these splits. Avoids some really ugly regressions in a future patch. https://reviews.llvm.org/D146859	2023-10-01 11:37:48 +03:00
Jay Foad	6e3d2a4b38	[ISel] Fix another crash in new FMA DAG combine (#67818 ) Following on from D135150, this patch fixes another crash caused by this DAG combine: fadd (fma A, B, (fmul C, D)), E --> fma A, B, (fma C, D, E) The combine calls ReplaceAllUsesOfValueWith to replace (fmul C, D) with (fma C, D, E). This can cause nodes to get CSEd. In D135150 the problem was that the (fma C, D, E) node got CSEd away. In this new case, the problem is that the outer fadd node gets CSEd away. To fix it we have to return SDValue(N, 0) from the combine and be careful not to add a deleted node to the worklist.	2023-09-29 17:18:23 +01:00
Nikita Popov	4251aa7a6f	[IRBuilder] Migrate most casts to folding API Migrate creation of most casts to use the FoldXYZ rather than CreateXYZ style APIs. This means that InstSimplifyFolder now works for these, which is what accounts for the AMDGPU test changes.	2023-09-29 12:40:38 +02:00
Mirko Brkušanin	2cd2445c21	[AMDGPU] Src1 of VOP3 DPP instructions can be SGPR on supported subtargets (#67461 ) In order to avoid duplicating every dpp pseudo opcode that has src1, we allow it for all opcodes and add manual checks on subtargets that do not support it.	2023-09-29 11:54:49 +02:00
Yashwant Singh	7ac532efc8	[AMDGPU] Introduce AMDGPU::SGPR_SPILL asm comment flag (#67091 ) Use this flag to give more context to implicit def comments in assembly. Reviewed on phabricator: https://reviews.llvm.org/D153754	2023-09-29 11:15:01 +05:30
Tobias Stadler	305fbc1b32	Revert "[GlobalISel] LegalizationArtifactCombiner: Elide redundant G_AND" This reverts commit 3686a0b611c65f0d7190345b8e3e73cdca9fa657. This seems to have broken some sanitizer tests: https://lab.llvm.org/buildbot/#/builders/184/builds/7721	2023-09-29 03:35:40 +02:00
Tobias Stadler	3686a0b611	[GlobalISel] LegalizationArtifactCombiner: Elide redundant G_AND The legalizer currently generates lots of G_AND artifacts. For example between boolean uses and defs there is always a G_AND with a mask of 1, but when the target uses ZeroOrOneBooleanContents, this is unnecessary. Currently these artifacts have to be removed using post-legalize combines. Omitting these artifacts at their source in the artifact combiner has a few advantages: - We know that the emitted G_AND is very likely to be useless, so our KnownBits call is likely worth it. - The G_AND and G_CONSTANT can interrupt e.g. G_UADDE/... sequences generated during legalization of wide adds which makes it harder to detect these sequences in the instruction selector (e.g. useful to prevent unnecessary reloading of AArch64 NZCV register). - This cleans up a lot of legalizer output and even improves compilation-times. AArch64 CTMark geomean: `O0` -5.6% size..text; `O0` and `O3` ~-0.9% compilation-time (instruction count). Since this introduces KnownBits into code-paths used by `O0`, I reduced the default recursion depth. This doesn't seem to make a difference in CTMark, but should prevent excessive recursive calls in the worst case. Reviewed By: aemerson Differential Revision: https://reviews.llvm.org/D159140	2023-09-29 02:11:57 +02:00
Jay Foad	c3939eb827	[AMDGPU] Fix typo in scheduler option name (#67661 ) Fix: -amdgpu-disable-unclustred-high-rp-reschedule Now: -amdgpu-disable-unclustered-high-rp-reschedule	2023-09-28 20:54:57 +01:00
Jay Foad	a0a06b1804	[AMDGPU] Make a check slightly more robust Previously this was relying on [[RESULT]] having been defined in an earlier function.	2023-09-28 13:09:51 +01:00
Ivan Kosarev	be8b559956	[AMDGPU] Test codegen'ing True16 additions. The GlobalISel part is to be addressed later. Differential Revision: https://reviews.llvm.org/D156106	2023-09-27 11:10:48 +01:00
Ivan Kosarev	3ff7d51eb8	[AMDGPU][True16] Pre-commit addition tests. Differential Revision: https://reviews.llvm.org/D156529	2023-09-27 10:27:33 +01:00
Jay Foad	e3d714f2cc	[AMDGPU] Add gfx1150 test coverage in trans-forwarding-hazards.mir This demonstrates that gfx1150 does not have FeatureVALUTransUseHazard.	2023-09-26 17:24:43 +01:00
Ivan Kosarev	64482d5766	[AMDGPU] Fix passing CodeGen/AMDGPU/frem.ll on gfx1150. (#67425 ) We would currently crash on it trying to use t16 instructions instead of fake16 ones.	2023-09-26 15:13:23 +01:00
Ivan Kosarev	287f6cdd17	[AMDGPU] Remove the support for non-True16 copies between different register sizes. Differential Revision: https://reviews.llvm.org/D156985	2023-09-26 14:46:34 +01:00
Jingu Kang	ff68e43c81	[MachineLICM] Handle Subloops It is a re-commit from reverted commit 3454cf67bd0a650097dc6ca99874a34e1d59b500. Following discussion on https://reviews.llvm.org/D154205, make MachineLICM pass handle subloops with only visiting outermost loop's blocks once. Differential Revision: https://reviews.llvm.org/D154205	2023-09-26 14:25:11 +01:00
Jay Foad	d85d143ad9	[AMDGPU] New image intrinsic optimizer pass (#67151 ) Implement a new pass to combine multiple image_load_2dmsaa and 2darraymsaa intrinsic calls into a single image_msaa_load if: - they refer to the same vaddr except for sample_id, - they use a constant sample_id and they fall into the same group, - they have the same dmask and the number of instructions and the number of vaddr/vdata dword transfers is reduced by the combine This should be valid on all GFX11 but a hardware bug renders it unworkable on GFX11.0.* so it is only enabled for GFX11.5. Based on a patch by Rodrigo Dominguez!	2023-09-26 09:33:49 +01:00

1 2 3 4 5 ...

6856 Commits