llvm-project

Author	SHA1	Message	Date
Robert Imschweiler	bcba3117c0	[AMDGPU] SelDAG: fix lowering of undefined workitem intrinsics (#126058 ) GlobalISel already handles undefined workitem.id.{x,y,z} intrinsics, SelDAG failed in AMDGPUISelLowering.cpp due to a failed assertion in `AMDGPUTargetLowering::loadInputValue`: `Arg && "Attempting to load missing argument"`. This commit changes the behavior of SelDAG to instead use a zero constant. This LLVM defect was identified via the AMD Fuzzing project.	2025-02-12 18:41:41 -05:00
Jeffrey Byrnes	c5a4512d85	[AMDGPU] iglp.opt does not clobber memory operands (#126976 ) I think it was an accident that this wasn't included.	2025-02-12 14:11:02 -08:00
Akshat Oke	7b60e03d73	Reland "CodeGen][NewPM] Port MachineScheduler to NPM. (#125703 )" (#126684 ) `RegisterClassInfo` was supposed to be kept alive between pass runs, which wasn't being done leading to recomputations increasing the compile time. Now the Impl class is a member of the legacy and new passes so that it is not reconstructed on every pass run. --------- Co-authored-by: Christudasan Devadasan <christudasan.devadasan@amd.com>	2025-02-12 18:54:39 +05:30
Vikram Hegde	9c725ef368	[AMDGPU][NewPM] Port "GCNRewritePartialRegUses" pass to NPM (#126024 )	2025-02-12 11:21:40 +05:30
Krzysztof Drewniak	934c97dd16	[LowerBufferFatPointers] Fix support for GEP T, p7, <N x T> idxs (#126126 ) The lowering for GEP didn't properly support the case where the pointer argument was being implicitly broadcast by a vector of indices. Fix that. --------- Co-authored-by: Matt Arsenault <arsenm2@gmail.com>	2025-02-11 18:22:50 -06:00
Brox Chen	ad6cd7e8b2	[AMDGPU][True16][CodeGen] true16 codegen for MadFmaMixPat (#124892 ) true16 codegen for MadFmaMixPat. GISEL test not enabled and will be added later when GISEL is supported	2025-02-11 17:36:44 -05:00
Vigneshwar Jayakumar	1188b1ff7b	AMDGPU: Handle gfx950 XDL Write-VGPR-VALU-WAW wait state change (#126132 ) There are additional wait states for XDL write VALU WAW hazard in gfx950 compared to gfx940.	2025-02-12 01:32:23 +07:00
Vigneshwar Jayakumar	a2263eba4d	AMDGPU: Handle gfx950 XDL-write-VGPR-VALU-Mem-Exp wait state change (#126727 )	2025-02-12 01:30:53 +07:00
Vigneshwar Jayakumar	c837f57286	AMDGPU: Handle gfx950 XDL-write-VGPR-Overlap-Src-AB wait state (#126732 ) gfx950 needs more additional waitstates from gfx940	2025-02-11 22:30:16 +07:00
Juan Manuel Martinez Caamaño	dd59198647	[NFC][AMDGPU] Rename test (#126725 ) The demonte-scc transformation is no longer needed and the old test name doesn't make sense anymore. The test checks the generated assembly for different branch cases * without metadata, * with the same branch_weights on each edge and * with a branch_weights that corresponds to the [[likely]] attribute	2025-02-11 15:10:37 +01:00
Vikram Hegde	3293bff5d2	[AMDGPU][NewPM] Port "GCNPreRAOptimizations" pass to NPM (#126040 )	2025-02-11 11:09:38 +05:30
Shilei Tian	bde8ce6a5c	[AMDGPU] Only run `AMDGPUPrintfRuntimeBindingPass` at non-prelink phase (#125162 )	2025-02-10 08:24:50 -05:00
Shilei Tian	70fdd9f0a2	[GlobalISel] Check whether `G_CTLZ` is legal in `matchUMulHToLShr` (#126457 ) We need to check `G_CTLZ` because the combine uses `G_CTLZ` to get log base 2, and it is not always legal for on a target. Fixes SWDEV-512440.	2025-02-10 00:11:09 -05:00
Shilei Tian	967973512b	[AMDGPU] Don't unify divergent exit nodes with `musttail` calls (#126395 ) Fixes SWDEV-512254.	2025-02-09 21:48:24 -05:00
Akshat Oke	564b9b7f4d	Revert "CodeGen][NewPM] Port MachineScheduler to NPM. (#125703 )" (#126268 ) This reverts commit 5aa4979c47255770cac7b557f3e4a980d0131d69 while I investigate what's causing the compile-time regression.	2025-02-08 15:36:48 +05:30
Brox Chen	2b43543afb	[AMDGPU][True16][MC][CodeGen] true16 for v_alignbyte_b32 (#125706 ) Support true16 format for v_alignbyte_b32 in MC and CodeGen	2025-02-07 09:54:11 -05:00
Matt Arsenault	d21fc58aee	AMDGPU: Use default shouldRewriteCopySrc (#125535 ) This was ultimately working around bugs in subregister handling in peephole-opt. In the common case, it would give up on folding anything into a subregister extract copy.	2025-02-07 12:31:14 +07:00
Jeffrey Byrnes	16f7e961c6	[AMDGPU] Allow rematerialization of instructions with virtual register uses (#124327 ) Remove the restriction that scheduling rematerialization candidates cannot have virtual reg uses. Currently, this only allows for virtual reg uses which are already live at the rematerialization point, so bring in allUsesAvailableAt to check for this condition. Because of this condition, the uses of the remats will already be live in to the region, so the remat won't increase live-in pressure. Add an expensive check to check this condition.	2025-02-06 10:16:28 -08:00
Matt Arsenault	58a88001f3	PeepholeOpt: Fix looking for def of current copy to coalesce (#125533 ) This fixes the handling of subregister extract copies. This will allow AMDGPU to remove its implementation of shouldRewriteCopySrc, which exists as a 10 year old workaround to this bug. peephole-opt-fold-reg-sequence-subreg.mir will show the expected improvement once the custom implementation is removed. The copy coalescing processing here is overly abstracted from what's actually happening. Previously when visiting coalescable copy-like instructions, we would parse the sources one at a time and then pass the def of the root instruction into findNextSource. This means that the first thing the new ValueTracker constructed would do is getVRegDef to find the instruction we are currently processing. This adds an unnecessary step, placing a useless entry in the RewriteMap, and required skipping the no-op case where getNewSource would return the original source operand. This was a problem since in the case of a subregister extract, shouldRewriteCopySource would always say that it is useful to rewrite and the use-def chain walk would abort, returning the original operand. Move the process to start looking at the source operand to begin with. This does not fix the confused handling in the uncoalescable copy case which is proving to be more difficult. Some currently handled cases have multiple defs from a single source, and other handled cases have 0 input operands. It would be simpler if this was implemented with isCopyLikeInstr, rather than guessing at the operand structure as it does now. There are some improvements and some regressions. The regressions appear to be downstream issues for the most part. One of the uglier regressions is in PPC, where a sequence of insert_subrgs is used to build registers. I opened #125502 to use reg_sequence instead, which may help. The worst regression is an absurd SPARC testcase using a <251 x fp128>, which uses a very long chain of insert_subregs. We need improved subregister handling locally in PeepholeOptimizer, and other pasess like MachineCSE to fix some of the other regressions. We should handle subregister composes and folding more indexes into insert_subreg and reg_sequence.	2025-02-05 23:29:02 +07:00
Christudasan Devadasan	b83c960bad	[CodeGen][NewPM] Port SIWholeQuadMode to NPM. (#125833 )	2025-02-05 18:44:57 +05:30
Akshat Oke	f77f777f35	[CodeGen][NewPM] Port RenameIndependentSubregs to NPM (#125192 )	2025-02-05 17:54:57 +05:30
Christudasan Devadasan	44f638f88e	CodeGen][NewPM] Port PostRAScheduler to NPM. (#125798 )	2025-02-05 12:45:59 +05:30
Christudasan Devadasan	5aa4979c47	CodeGen][NewPM] Port MachineScheduler to NPM. (#125703 )	2025-02-05 12:17:59 +05:30
Robert Imschweiler	21560fe6b9	GlobalISel: Fix defined register of invariant.start (#125664 ) In contrast to SelectionDAG, GlobalISel created a new virtual register for the return value of invariant.start, leaving subsequent users of the invariant.start value with an undefined reference. A minimal example: ``` %tmp = alloca i32, align 4, addrspace(5) %tmpI = call ptr @llvm.invariant.start.p5(i64 4, ptr addrspace(5) %tmp) #3 call void @llvm.invariant.end.p5(ptr %tmpI, i64 4, ptr addrspace(5) %tmp) #3 store i32 %i, ptr %tmpI, align 4 ``` Although the return value of invariant.start might not be intended for any use beyond invariant.end (the fuzzer might not have created a sensible situation here), an implicit definition of the corresponding virtual register avoids a segfault in the target instruction selector later. This LLVM defect was identified via the AMD Fuzzing project.	2025-02-04 23:59:03 +07:00
Brox Chen	5eff19f48b	[AMDGPU][True16][Codegen] true16 codegen for FPtoI1 (#125120 ) True16 codegen for FPtoi1. It seems tablegen figured out the pattern even without this pat in place, and the fptoui/fptosi.ll already got the right transformation. Aditionally updated the mir file and split it to pre-gfx11 and post-gfx11.	2025-02-04 11:20:40 -05:00
Brox Chen	6515fdf73d	[AMDGPU][True16][CodeGen] true16 codegen for FPMinMax pat (#125107 ) true16 codegen for FPMinMax Pattern	2025-02-04 11:20:17 -05:00
Akshat Oke	4313345f2e	[CodeGen][NewPM] Port MachineCopyPropagation to NPM (#125202 )	2025-02-04 15:45:03 +05:30
Matt Arsenault	2f2ac3de69	DAG: Avoid stack usage in bitcast operand promotion to legal vector (#125637 ) Fix introducing stack usage if a bitcast source operand is an illegal integer type cast to a legal vector type. This should cover more situations, but this is the first one I noticed.	2025-02-04 16:43:42 +07:00
Matt Arsenault	cdca04913a	DAG: Avoid introducing stack usage in vector->int bitcast int op promotion (#125636) Avoids stack usage in the v5i32 to i160 case for AMDGPU, which appears in fat pointer lowering.	2025-02-04 16:32:47 +07:00
David Stuttard	6c560ef33e	[AMDGPU] Add .entry_point back into PAL metadata (#125505 )	2025-02-04 08:19:05 +00:00
Fabian Ritter	b95a6c750c	[AMDGPU] Remove special cases in TTI::getMemcpyLoop(Residual)LoweringType (#125507 ) These special cases limit the width of memory operations we use for lowering memcpy/memmove when the pointer arguments are 2-aligned or in the LDS/GDS. I found that performance in microbenchmarks on gfx90a, gfx1030, and gfx1100 is better without this limitation.	2025-02-04 08:18:24 +01:00
Matt Arsenault	077e0c134a	AMDGPU: Generalize truncate of shift of cast build_vector combine (#125617 ) Previously we only handled cases that looked like the high element extract of a 64-bit shift. Generalize this to handle any multiple indexing. I was hoping this would help avoid some regressions, but it did not. It does however reduce the number of steps the DAG takes to process these cases. NFC-ish, I have yet to find an example where this changes the final output.	2025-02-04 11:46:30 +07:00
Simon Pilgrim	b7c8271601	[DAG] getNode - convert scalar i1 arithmetic calls to bitwise instructions (#125486 ) We already do this for vector vXi1 types - this patch removes the vector constraint to handle it for all bool types.	2025-02-03 16:36:01 +00:00
Akshat Oke	fe9a97ca38	[CodeGen][NewPM] Port RegisterCoalescer to NPM (#124698 )	2025-02-03 13:41:51 +07:00
David Green	070e129304	[AArch64][GlobalISel] Add disjoint handling for add_and_or_is_add. (#123594 ) This allows us to easily detect, without known-bits, that the or in a fshl/fshr is disjoint allowing us to use usra under aarch64.	2025-02-02 21:01:49 +00:00
Sergei Barannikov	ff9c041d96	[MachineScheduler] Fix physreg dependencies of ExitSU (#123541 ) Providing the correct operand index allows addPhysRegDataDeps to compute the correct latency. Pull Request: https://github.com/llvm/llvm-project/pull/123541	2025-02-01 20:40:50 +03:00
Brox Chen	a51798e3d6	[AMDGPU][True16][CodeGen] true16 codegen pat for fptrunc_round (#124044 ) true16 codegen pattern for fptrunc_round f32 to f16. For mir test, split to preGFX11 and postGFX11. and add a true16 and a fake16 test accordingly	2025-01-30 18:31:52 -05:00
Jon Chesterfield	4f358d75d0	[amdgpu][nfc] Post-commit feedback on c39fba209	2025-01-30 20:07:44 +00:00
Stanislav Mekhanoshin	8a20c6459e	[AMDGPU] Create new option for force flush load counter (#124974 ) In ceratin situations it is beneficial to wait for all outstanding loads regardless of specific load's data we need. This may allow to reduce a number of cache requests. Fixes: SWDEV-511507	2025-01-30 11:14:38 -08:00
Brox Chen	33d401fb15	[AMDGPU][True16][CodeGen] true16 codegen for icmp and is_fpclass (#124757 ) True16 codegen pattern for icmp patterns and is_fpclass	2025-01-30 12:18:00 -05:00
Jon Chesterfield	c39fba209c	[AMDGPU] S_SET_GPR_IDX_ON can be passed an immediate index (#125086 ) Oversight found by ISel fuzz effort. Assuming the argument is a register, in some cases it can be an immediate. Tablegen's type for the instruction is SSrc_b32, i.e. register or immediate fine. Added the repro from the bug reporter as a test case - prior to this patch llvm will assert in getReg. Fixes SWDEV-508589	2025-01-30 16:40:12 +00:00
Matt Arsenault	41f76070f3	AMDGPU: Regenerate test checks	2025-01-30 22:55:33 +07:00
Matt Arsenault	d246cc618a	PeepholeOpt: Do not add subregister indexes to reg_sequence operands (#124111 ) Given the rest of the pass just gives up when it needs to compose subregisters, folding a subregister extract directly into a reg_sequence is counterproductive. Later fold attempts in the function will give up on the subregister operand, preventing looking up through the reg_sequence. It may still be profitable to do these folds if we start handling the composes. There are some test regressions, but this mostly looks better.	2025-01-30 20:42:02 +07:00
Matt Arsenault	6017480461	MachineVerifier: Fix check for range type (#124894 ) We need to permit scalar extending loads with range annotations. Fix expensive_checks failures after 11db7fb09b36e656a801117d6a2492133e9c2e46	2025-01-30 10:56:12 +07:00
Matt Arsenault	97a1f494a6	DAG: Avoid breaking legal vector_shuffle with multiple uses (#123712 ) Previously this combine would undo AMDGPU's new custom legalization of wide vector shuffles into 2 element pieces. The comment also states that this combine is only done before legalization, but the case with a build_vector source was unconditional. We probably don't want to do this if the multiple uses are full scalarization of the vector, but this seems to work well enough. Scalarizing extracts should have folded out pre-legalize.	2025-01-30 10:55:21 +07:00
Carl Ritson	a3a3e6997b	[AMDGPU] Rewrite GFX12 SGPR hazard handling to dedicated pass (#118750 ) - Algorithm operates over whole IR to attempt to minimize waits. - Add support for VALU->VALU SGPR hazards via VA_SDST/VA_VCC.	2025-01-30 11:21:11 +09:00
Konstantina Mitropoulou	9adc99bcc5	[AMDGPU] Always emit SI_KILL_I1_PSEUDO for uniform floating point branches. (#124028 ) - [NFC] Use GCNPat instead of Pat. - [AMDGPU] Always emit SI_KILL_I1_PSEUDO for uniform floating point branches. --------- Co-authored-by: Konstantina Mitropoulou <KonstantinaMitropoulou@amd.com>	2025-01-29 09:00:40 -08:00
Nikita Popov	29441e4f5f	[IR] Convert from nocapture to captures(none) (#123181 ) This PR removes the old `nocapture` attribute, replacing it with the new `captures` attribute introduced in #116990. This change is intended to be essentially NFC, replacing existing uses of `nocapture` with `captures(none)` without adding any new analysis capabilities. Making use of non-`none` values is left for a followup. Some notes: * `nocapture` will be upgraded to `captures(none)` by the bitcode reader. * `nocapture` will also be upgraded by the textual IR reader. This is to make it easier to use old IR files and somewhat reduce the test churn in this PR. * Helper APIs like `doesNotCapture()` will check for `captures(none)`. * MLIR import will convert `captures(none)` into an `llvm.nocapture` attribute. The representation in the LLVM IR dialect should be updated separately.	2025-01-29 16:56:47 +01:00
Acim Maravic	3a29dfe37c	[LLVM][AMDGPU] Add Intrinsic and Builtin for ds_bpermute_fi_b32 (#124616 )	2025-01-29 14:04:10 +01:00
David Green	66e0498daf	[GlobalISel] Do not run verifier after ResetMachineFunctionPass (#124799 ) After we fall back from GlobalISel to SDAG, the verifier gets called, which calls getReservedRegs which uses SIMachineFunctionInfo::usesAGPRs which caches the result of UsesAGPRs. Because we have just fallen-back the function is empty and it incorrectly gets cached to false. This patch makes sure we don't try to run the verifier whilst the function is empty.	2025-01-29 12:48:11 +00:00

1 2 3 4 5 ...

8259 Commits