llvm-project

Author	SHA1	Message	Date
paperchalice	abde52aa66	[CodeGen][NewPM] Port `LiveIntervals` to new pass manager (#98118 ) - Add `LiveIntervalsAnalysis`. - Add `LiveIntervalsPrinterPass`. - Use `LiveIntervalsWrapperPass` in legacy pass manager. - Use `std::unique_ptr` instead of raw pointer for `LICalc`, so destructor and default move constructor can handle it correctly. This would be the last analysis required by `PHIElimination`.	2024-07-10 19:34:48 +08:00
Fabian Ritter	17316a5989	Revert "[LowerMemIntrinsics] Use correct alignment in residual loop for variable llvm.memcpy" (#98295 ) Reverts llvm/llvm-project#97998 This seems to cause a buildbot failure on clang-hip-vega20, in the HIP test-suite, need to investigate.	2024-07-10 12:16:20 +02:00
Fabian Ritter	6c84bba218	[LowerMemIntrinsics] Use correct alignment in residual loop for variable llvm.memcpy (#97998 ) Memcpy intrinsics with statically unknown loop sizes are lowered with two load/store loops: one with access widths specified by the target, and a residual loop that copies remaining bytes individually. As the residual loop operates byte-wise, its accesses are only 1-aligned. However, we currently use the alignment that is optimal for the first loop in both, which is unsound. With this patch, we use the correct alignment in the residual loop. The lowering of memcpy with a static size already handles alignments for the residual correctly.	2024-07-10 11:29:26 +02:00
Carl Ritson	7eb1a320cc	[AMDGPU] Update EXECZ retention in SIPreEmitPeephole for GFX10/12 (#97676 ) The check to maintain EXECZ branches only checks S_WAITCNT. Add handling for new waitcnt instructions in GFX10 and GFX12.	2024-07-09 14:44:31 +09:00
Manish Kausik H	69192e0193	[LegalizeDAG] Optimize CodeGen for `ISD::CTLZ_ZERO_UNDEF` (#83039 ) Previously we had the same instructions being generated for `ISD::CTLZ` and `ISD::CTLZ_ZERO_UNDEF` which did not take advantage of the fact that zero is an invalid input for `ISD::CTLZ_ZERO_UNDEF`. This commit separates codegen for the two cases to allow for the optimization for the latter case. The details of the optimization are outlined in #82075 Fixes #82075 Co-authored-by: Manish Kausik H <hmamishkausik@gmail.com>	2024-07-08 14:01:32 +01:00
Vikram Hegde	2a9607168b	[AMDGPU] Cleanup bitcast spam in atomic optimizer (#96933 )	2024-07-08 10:53:16 +05:30
Matt Arsenault	611212fc9a	AMDGPU/GlobalISel: Legalize atomicrmw fmin/fmax (#97048 ) We only handled the easy LDS case before. Handle the other address spaces with the more complicated legality logic.	2024-07-03 23:30:05 +02:00
Jeffrey Byrnes	5da7179cb3	[AMDGPU] Reland: Add IR LiveReg type-based optimization	2024-07-03 09:26:19 -07:00
Yingwei Zheng	d5c9ffd545	[SDAG] Intersect poison-generating flags after CSE (#97434 ) This patch fixes a miscompilation when `N` gets CSEed to `Existing`: ``` Existing: t5: i32 = sub nuw Constant:i32<0>, t3 N: t30: i32 = sub Constant:i32<0>, t3 ``` Fixes https://github.com/llvm/llvm-project/issues/96366.	2024-07-03 20:32:46 +08:00
Fabian Ritter	d37e7ec2c5	[LowerMemIntrinsics] Respect the volatile argument of llvm.memmove (#97545 ) So far, we ignored if a memmove intrinsic is volatile when lowering it to loops in the IR. This change generates volatile loads and stores in this case (similar to how memcpy is handled) and adds tests for volatile memmoves and memcpys.	2024-07-03 13:37:38 +02:00
Jay Foad	b76dd4edbf	[AMDGPU] Disable atomic optimization of fadd/fsub with result (#96479 ) An atomic fadd instruction like this should return %x: ; value at %ptr is %x %r = atomicrmw fadd ptr %ptr, float %y After atomic optimization, if %y is uniform, the result is calculated as %r = %x + * %y * +0.0. This has a couple of problems: 1. If %y is Inf or NaN, this will return NaN instead of %x. 2. If %x is -0.0 and %y is positive, this will return +0.0 instead of -0.0. Avoid these problems by disabling the "%y is uniform" path if there are any uses of the result.	2024-07-03 11:35:51 +01:00
Alexis Engelke	bb260eb87d	[CodeGen] Only deduplicate PHIs on critical edges (#97064 ) PHIElim deduplicates identical PHI nodes to reduce the number of copies inserted. There are two cases: 1. Identical PHI nodes are in different blocks. That's the reason for this optimization; this can't be avoided at SSA-level. A necessary prerequisite for this is that the predecessors of all basic blocks (where such a PHI node could occur) are the same. This implies that all (>= 2) predecessors must have multiple successors, i.e. all edges into the block are critical edges. 2. Identical PHI nodes are in the same block. CSE can remove these. There are a few cases, however, where they still occur regardless: - expand-large-div-rem creates PHI nodes with large integers, which get lowered into one PHI per MVT. Later, some identical values (zeroes) get folded, resulting in identical PHI nodes. - peephole-opt occasionally inserts PHIs for the same value. - Some pseudo instruction emitters create redundant PHI nodes (e.g., AVR's insertShift), merging the same values more than once. In any case, this happens rarely and MachineCSE handles most cases anyway, so that PHIElim only gets to see very few of such cases (see changed test files). Currently, all PHI nodes are inserted into a DenseMap that checks equality not by pointer but by operands. This hash map is pretty expensive (hashing itself and the hash map), but only really useful in the first case. Avoid this expensive hashing most of the time by restricting it to basic blocks with only critical input edges. This improves performance for code with many PHI nodes, especially at -O0. (Note that Clang often doesn't generate PHI nodes and -O0 includes no mem2reg. Other compilers always generate PHI nodes.)	2024-07-03 11:19:05 +02:00
Jay Foad	f3a02253e9	[test] Remove immarg parameter attribute from calls (#97432 ) It is documented that immarg is only valid on intrinsic declarations, although the verifier also tolerates it on intrinsic calls. This patch updates tests that are not specifically testing the behavior of the IR parser or verifier.	2024-07-03 09:02:31 +01:00
Fabian Ritter	e1094dd889	[AMDGPU][DAG] Enable ganging up of memcpy loads/stores for AMDGPU (#96185 ) In the SelectionDAG lowering of the memcpy intrinsic, this optimization introduces additional chains between fixed-size groups of loads and the corresponding stores. While initially introduced to ensure that wider load/store-pair instructions are generated on AArch64, this optimization also improves code generation for AMDGPU: Ganged loads are scheduled into a clause; stores only await completion of their corresponding load. The chosen value of 16 performed good in microbenchmarks, values of 8, 32, or 64 would perform similarly. The testcase updates are autogenerated by utils/update_llc_test_checks.py. See also: - PR introducing this optimization: https://reviews.llvm.org/D46477 Part of SWDEV-455845.	2024-07-03 08:32:35 +02:00
Matt Arsenault	79516ddbee	AMDGPU: Fix assert from wrong address space size assumption (#97267 ) This was assuming the source address space was at least as large as the destination of the cast. I'm not sure why this was casting to begin with; the assumption seems to be the source address space from the root addrspacecast matches the underlying object so directly check that. Fixes #97457	2024-07-02 23:18:25 +02:00
Simon Pilgrim	1f7d31e342	[AMDGPU] Regenerate srem.ll tests - more closely match the testing in sdiv.ll	2024-07-02 17:14:39 +01:00
Jay Foad	43b9888214	[AMDGPU] Use nan as the identity for atomicrmw fmax/fmin (#97411 ) atomicrmw fmax/fmin perform the same operation as llvm.maxnum/minnum which return the other operand if one operand is nan. This means that, in the presence of nan arguments, +/- inf is not an identity for these operations but nan is (at least if you don't care about nan payloads).	2024-07-02 15:45:36 +01:00
Matt Arsenault	940ea5b8c5	AMDGPU: Add some exotic truncating store tests PR#97010 is touching the legalize rules for 5 vector stores, but not all of them so check some more cases to make sure they work.	2024-07-02 16:33:36 +02:00
Shilei Tian	9a4f57ec1e	[SelectionDAG] Use `EVT::getIntegerVT` in `getBitcastedAnyExtOrTrunc` (#96658 ) `SelectionDAG::getBitcastedAnyExtOrTrunc` assumes that there is always a valid integer type corresponding to another type, which is not always true when it comes to vector type. For example, `<3 x i8>` doesn't have a corresponding integer type. Fix SWDEV-464698.	2024-07-01 15:10:57 -04:00
Matt Arsenault	bff619f910	Revert "AMDGPU: Use real copysign in fast pow (#97152 )" This reverts commit d3e7c4ce7a3d7f08cea02cba8f34c590a349688b.	2024-07-01 20:54:50 +02:00
Matt Arsenault	d3e7c4ce7a	AMDGPU: Use real copysign in fast pow (#97152 ) Previously this would introduce some codegen regressions, but those have been avoided by simplifying demanded bits on copysign operations.	2024-07-01 20:16:22 +02:00
Jeffrey Byrnes	f903e3ec77	[AMDGPU] Reset kill flags for multiple uses of SDWAInst Ops Change-Id: I8b56d86a55c397623567945a87ad2f55749680bc	2024-07-01 09:14:02 -07:00
Matt Arsenault	0d88f662ff	GlobalISel: ComputeNumSignBits from load range metadata We're missing SimplifyDemandedBits styles of optimizations, so one case differs from the DAG from not trimming the constant. The other case is an optimization we get that the DAG doesn't do to split the 64-bit shift. https://reviews.llvm.org/D138082	2024-07-01 15:26:50 +02:00
Matt Arsenault	7032076242	GlobalISel: Drop vector range metadata on bitcast lowering (#97279 ) If we are reinterpreting the type, the range metadata also needs to be converted. I believe the DAG has the same bug.	2024-07-01 15:26:09 +02:00
Matt Arsenault	8eee6d33f7	DAG: Call SimplifyDemandedBits on copysign value operand (#97180 ) So far the only cases that seem to benefit are the weird copysign with different typed inputs.	2024-07-01 12:29:11 +02:00
Matt Arsenault	db9252b115	DAG: Call SimplifyDemandedBits on fcopysign sign value (#97151 ) Math library code has quite a few places with complex bit logic that are ultimately fed into a copysign. This helps avoid some regressions in a future patch. This assumes the position in the float type, which should at least be valid for IEEE types. Not sure if we need to guard against ppc_fp128 or anything else weird. There appears to be some value in simplifying the value operand as well, but I'll address that separately.	2024-07-01 12:19:17 +02:00
Matt Arsenault	c769dc457c	AMDGPU: Add baseline test for copysign combine (#97150 ) Pre-commit tests showing we try to SimplifyDemandedBits on the sign operand.	2024-07-01 12:14:57 +02:00
Matt Arsenault	3562001007	AMDGPU: Regenerate test checks to avoid spurious diff	2024-07-01 11:10:17 +02:00
Nuno Lopes	0e6257fbc2	SSAUpdater: use poison instead of undef in phi entries for unreachable predecessors	2024-06-30 11:51:30 +01:00
Matt Arsenault	76bc071418	DAG: Fix assert when legalizing v3f16 ldexp (#97098 ) For the v3f16.v3i32 case, the v3f16 would request widening to v4f16, but the v3i32 does not require widening to be a legal type, so GetWidenedVector would fail. We need to widen the exponent vector to the same element count as the result. Fixes: SWDEV-470951	2024-06-30 08:29:20 +02:00
Vitaly Buka	3e53c97d33	Revert "[AMDGPU] Add IR LiveReg type-based optimization" (#97138 ) Part of #66838. https://lab.llvm.org/buildbot/#/builders/52/builds/404 https://lab.llvm.org/buildbot/#/builders/55/builds/358 https://lab.llvm.org/buildbot/#/builders/164/builds/518 This reverts commit ded956440739ae326a99cbaef18ce4362e972679.	2024-06-28 23:18:26 -07:00
Vigneshwar Jayakumar	d2c817df84	[AMDGPU] Fix DynLDS causing crash when LowerLDS is run at fullLTO pipeline (#96038 ) Direct mapped dynamic LDS is not lowered in the LowerLDSModule pass. Hence it is not marked with an absolute symbol. When the LowerLDS pass is rerun in LTO, compilation fails with an assert "cannot mix abs and non-abs LDVs". This patch adds an additional check for direct mapped dynLDS to skip the assert. Fixes SWDEV-454281	2024-06-28 21:05:48 -05:00
Jeffrey Byrnes	ded9564407	[AMDGPU] Add IR LiveReg type-based optimization Change-Id: Ia0d11b79b8302e79247fe193ccabc0dad2d359a0	2024-06-28 15:01:39 -07:00
Matt Arsenault	2df2373eb8	DAG/GlobalISel: Set disjoint for or in copysign lowering (#97057 ) We masked out the sign bit from one value, and the non-sign bits from the other so there should be no common bits set. No idea how to test this on the DAG path, other than scraping the debug logs. A few targets hit this path with f16 values, but the resulting i16 ors get anyext promoted and lose the disjoint flag. In the fp128 case, PPC gets further and the or loses the flag somewhere else later. Adding a haveNoCommonBits assert shows this works though.	2024-06-28 23:03:39 +02:00
Matt Arsenault	28d142a485	AMDGPU/GlobalISel: Make pk f16 atomicrmw fadd legal for gfx908 The subtarget features for these are a bit of a mess; the no return version should probably be implied by the with-return feature.	2024-06-28 11:33:42 +02:00
isuckatcs	937d79bc9d	[GlobalISel][AArch64][AMDGPU] Expand FPOWI into series of multiplication (#95217 ) SelectionDAG already converts FPOWI into a series of optimized multiplications, this patch introduces the same optimization into GlobalISel.	2024-06-28 09:57:50 +02:00
Matt Arsenault	a2a73d892a	AMDGPU: Fix no return atomicrmw fadd v2f16 selection for gfx908 (#96948 ) We previously would always expand this with a cmpxchg loop, while it should be the same conditions as the f32 case (except for the denormal concern).	2024-06-27 21:17:16 +02:00
Jay Foad	4e70720139	[AMDGPU] Add some gfx1200 test coverage	2024-06-27 14:53:59 +01:00
Matt Arsenault	4477ff6836	AMDGPU: Remove ds_fmin/ds_fmax intrinsics (#96739 ) These have been replaced with atomicrmw.	2024-06-27 15:35:24 +02:00
Jay Foad	bf536cc7db	[AMDGPU] Fix unwanted LICM/CSE of llvm.amdgcn.pops.exiting.wave.id (#96190 ) Mark both the intrinsic and the selected MachineInstr as having side effects to prevent MachineLICM and MachineCSE from moving/removing them.	2024-06-27 09:27:52 +01:00
Janek van Oirschot	17eaa23f7e	[AMDGPU] MCExpr-ify AMDGPU HSAMetadata (#94788 ) Enables MCExpr for HSAMetadata, particularly, HSAMetadata's msgpack format.	2024-06-26 16:39:08 +01:00
Vikram Hegde	35f7b60aa6	[AMDGPU] Extend permlane16, permlanex16 and permlane64 intrinsic lowering for generic types (#92725 ) These are incremental changes over #89217 , with core logic being the same. This patch along with #89217 and #91190 should get us ready to enable 64 bit optimizations in atomic optimizer.	2024-06-26 09:24:09 +05:30
Matt Arsenault	4f80f362a5	AMDGPU: Add new metadata and expand atomicrmw fadd expansion tests	2024-06-25 23:42:48 +02:00
Matt Arsenault	8bba070ef8	AMDGPU: Expand testing of atomicrmw fmin/fmax lowering Cover amdgpu.no.fine.grained.memory vs. amdgpu.no.remote.memory.	2024-06-25 23:42:48 +02:00
Jay Foad	aaf50bf34f	[AMDGPU] Disallow negative s_load offsets in isLegalAddressingMode (#91327 )	2024-06-25 17:43:00 +01:00
Matt Arsenault	889f3c5741	AMDGPU: Handle legal v2bf16 atomicrmw fadd for gfx12 (#95930 ) Annoyingly gfx90a/940 support this for global/flat but not buffer.	2024-06-25 17:45:34 +02:00
Vikram Hegde	5feb32ba92	[AMDGPU] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (#89217 ) This patch is intended to be the first of a series with end goal to adapt atomic optimizer pass to support i64 and f64 operations (along with removing all unnecessary bitcasts). This legalizes 64 bit readlane, writelane and readfirstlane ops pre-ISel --------- Co-authored-by: vikramRH <vikhegde@amd.com>	2024-06-25 14:35:19 +05:30
vangthao95	3aef525aa4	[AMDGPU] Fix negative immediate offset for unbuffered smem loads (#89165 ) For unbuffered smem loads, it is illegal for the immediate offset to be negative if the resulting IOFFSET + (SGPR[Offset] or M0 or zero) is negative. New PR of https://github.com/llvm/llvm-project/pull/79553.	2024-06-24 14:18:23 -07:00
Mariusz Sikora	689c5c4829	[AMDGPU] Set total VGPRs to 1536 for gfx12 (#96272 ) - Use Feature1_5xVGPRs	2024-06-24 13:26:03 +02:00
vg0204	c2fc7f75f6	Revert "[AMDGPU]Optimize SGPR spills (#93668 )" This reverts commit 4b9112e88a998ce620e4683548f2afd17cc5fe95. A separate issue(#96353) describing it has been opened to further keep its track.	2024-06-24 12:36:36 +05:30

1 2 3 4 5 ...

7565 Commits