llvm-project

Author	SHA1	Message	Date
Piotr Sobczak	fac093dd08	[AMDGPU] Update IEEE and DX10_CLAMP for GFX12 (#75030 ) Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>	2023-12-13 13:52:40 +01:00
Pierre van Houtryve	8a66510fa7	[AMDGPU] Don't create mulhi_24 in CGP (#72983 ) Instead, create a mul24 with a 64 bit result and let ISel take care of it. This allows patterns to simply match mul24 even for 64-bit muls instead of having to match both mul/mulhi and a buildvector/bitconvert/etc.	2023-11-30 08:26:45 +01:00
Matt Arsenault	231aa0f212	AMDGPU: Avoid creating vector extracts if we aren't going to do anything Try to avoid expensive checks failures from reporting no changes when some dead instructions were introduced.	2023-09-13 09:45:34 +03:00
Matt Arsenault	72a7024add	AMDGPU: Correctly lower llvm.sqrt.f32 Make codegen emit correctly rounded sqrt by default. Emit the fast but only kind of fast expansion in AMDGPUCodeGenPrepare based on !fpmath, like the fdiv case. Hack around visitation ordering problems from AMDGPUCodeGenPrepare using forward iteration instead of a well behaved combiner. https://reviews.llvm.org/D158129	2023-09-12 23:22:54 +03:00
Matt Arsenault	6012fed6f5	AMDGPU: Fix sqrt fast math flags spreading to fdiv fast math flags This was working around the lack of operator\| on FastMathFlags. We have that now which revealed the bug.	2023-08-30 11:53:05 -04:00
Matt Arsenault	a738bdf35e	AMDGPU: Permit more rsq formation in AMDGPUCodeGenPrepare We were basing the defer the fast case to codegen based on the fdiv itself, and not looking for a foldable sqrt input. https://reviews.llvm.org/D158127	2023-08-23 20:06:50 -04:00
pvanhout	b7503ae8a0	[AMDGPU] Clear BreakPhiNodesCache in-between functions Otherwise stale pointers pollute the cache and when a dead PHI's memory is reused for another PHI, we can get a false positive hit in the cache. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D157711	2023-08-11 15:23:41 +02:00
pvanhout	62ea799e6c	[AMDGPU] Break Large PHIs: Take whole PHI chains into account Previous heuristics had a big flaw: they only looked at single PHI at a time, and didn't take into account the whole "chain". The concept of "chain" is important because if we only break a chain partially, we risk forcing regalloc to reserve twice as many registers for that vector. We also risk adding a lot of copies that shouldn't be there and can inhibit backend optimizations. The solution I found is to consider the whole "PHI chain" when looking at PHI. That is, we recursively look at the PHI's incoming value & users for other PHIs, then make a decision about the chain as a whole. The currrent threshold requires that at least `ceil(chain size * (2/3))` PHIs have at least one interesting incoming value. In simple terms, two-thirds (rounded up) of the PHIs should be breakable. This seems to work well. A lower threshold such as 50% is too aggressive because chains can often have 7 or 9 PHIs, and breaking 3+ or 4+ PHIs in those case often causes performance issue. Fixes SWDEV-409648, SWDEV-398393, SWDEV-413487 Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D156414	2023-08-03 16:41:11 +02:00
Kazu Hirata	03612b2c1a	[AMDGPU] Fix an unused variable warning This patch fixes: llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp:1006:9: error: unused variable 'Ty' [-Werror,-Wunused-variable]	2023-07-21 16:14:41 -07:00
Matt Arsenault	6398b687c5	AMDGPU: Fix variables only used in asserts	2023-07-21 18:55:42 -04:00
Matt Arsenault	8406c3568a	AMDGPU: Implement new 2ulp fdiv lowering Extends the new frexp scaled reciprocal to the general case. The reciprocal case is just the same thing when frexp of 1 is constant folded. Could probably clean up the code to rely on that constant folding. Improves results for the IEEE path for the default OpenCL division. We used to only emit the fdiv.fast intrinsic with a 2.5 ulp accuracy threshold with DAZ, which uses explicit range checks. This gives us a better fast option with the default IEEE behavior.	2023-07-21 18:55:42 -04:00
Matt Arsenault	6699c37028	AMDGPU: Refactor AMDGPUCodeGenPrepare fdiv handling NFC-ish. Does trigger some reordering of the fdiv scalarization. Also skips scalarizing in more cases where nothing was going to happen. We can still scalarize in some no-op edge cases. https://reviews.llvm.org/D155740	2023-07-21 18:55:42 -04:00
Matt Arsenault	8287f3af9d	AMDGPU: Overhaul and improve rcp and rsq f32 formation The highlight change is a new denormal safe 1ulp lowering which uses rcp after using frexp to perform input scaling. This saves 2 instructions compared to other implementations which performed an explicit denormal range change. This improves the OpenCL default, and requires a flag for HIP. I don't believe there's any flag wired up for OpenMP to emit the necessary fpmath metadata. This provides several improvements and changes that were hard to separate without regressing one case or another. Disturbingly the OpenCL conformance test seems to have the reciprocal test commented out. I locally hacked it back in to test this. Starts introducing f32 rsq intrinsics in AMDGPUCodeGenPrepare. Like the rcp case, we could do this in codegen if !fpmath were preserved (although we would lose some computeKnownFPClass tricks). Start requiring contract flags to form rsq. The rsq fusion actually improves the result from ~2ulp to ~1ulp. We have some older fusion in codegen which only keys off unsafe math which should be refined. Expand rsq patterns by checking for denormal inputs and pre/post multiplying like the current library code does. We also take advantage of computeKnownFPClass to avoid the scaling when we can statically prove the input cannot be a denormal. We could do the same for the rcp case, but unlike rsq a large input can underflow to denormal. We need additional upper bound exponent checks on the input in order to do the same for rcp. This rsq handling also now starts handling the negated case. We introduce rsq with an fneg. In the case the fneg doesn't fold into its user, it's a neutral change but provides improvement if it is foldable as a source modifier. Also starts respecting the arcp attribute properly, and more strictly interprets afn. We were previously interpreting afn as implying you could do the reciprocal expansion of an fdiv. The codegen handling of these also needs to be revisited. This also effectively introduces the optimization combineRepeatedFPDivisors enables, just done in the IR instead (and only for f32). This is almost across the board better. The one minor regression is for gfx6/buggy frexp case where for multiple reciprocals, we could previously reuse rematerialized constants per instance (it's neutral for a single rcp). The fdiv.fast and sqrt handling need to be revisited next. https://reviews.llvm.org/D155593	2023-07-21 16:35:53 -04:00
Matt Arsenault	d33ab05467	AMDGPU: Add flag to disable fdiv processing in IR pass We kind of have to have multiple implementations of fdiv split between the two selectors with some pre-processing. Add yet another test to check for consistency of interpretation of flag combinations. We have quite a bit of test redundancy here already, but there are so many possible interesting permutations it's unwieldy to cover every detail in any one of them. We have a number of overlapping fdiv tests but it's hard to follow everything going on as it is.	2023-07-20 19:51:15 -04:00
pvanhout	e5296c52e5	[AMDGPU] Relax restrictions on unbreakable PHI users in BreakLargePHis The previous heuristic rejected a PHI if one of its user was an unbreakable PHI, no matter what the other users were. This worked well in most cases, but there's one case in rocRAND where it doesn't work. In that case, a PHI node has 2 PHI users where one is breakable but not the other. When that PHI node isn't broken performance falls by 35%. Relaxing the restriction to "require that half of the PHI node users are breakable" fixes the issue, and seems like a sensible change. Solves SWDEV-409648, SWDEV-398393 Reviewed By: #amdgpu, arsenm Differential Revision: https://reviews.llvm.org/D155184	2023-07-14 09:02:51 +02:00
Matt Arsenault	fbe4ff8149	AMDGPU: Partially fix not respecting dynamic denormal mode The most notable issue was producing v_mad_f32 in functions with the dynamic mode, since it just ignores the mode. fdiv lowering is still somewhat broken because it involves a mode switch and we need to query the original mode.	2023-07-11 15:14:52 -04:00
Matt Arsenault	64d325454b	AMDGPU: Delete custom combine on class intrinsic This is no longer necessary as class-with-constant will always be transformed to the generic class intrinsic. https://reviews.llvm.org/D153901	2023-07-07 15:28:21 -04:00
Matt Arsenault	9c82dc6a6b	AMDGPU: Always use v_rcp_f16 and v_rsq_f16 These inherited the fast math checks from f32, but the manual suggests these should be accurate enough for unconditional use. The definition of correctly rounded is 0.5ulp, but the manual says "0.51ulp". I've been a bit nervous about changing this as the OpenCL conformance test does not cover half. Brute force produces identical values compared to a reference host implementation for all values.	2023-07-05 16:53:01 -04:00
Anshil Gandhi	a22ef958cb	[AMDGPUCodegenPrepare] Add NewPM Support Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D151241	2023-05-26 00:20:01 -06:00
pvanhout	fa87dd52d4	[AMDGPU] Handle multiple occurences of an incoming value in break large PHIs We naively broke all incoming values, assuming they'd be unique. However it's not illegal to have multiple occurences of, e.g. `[BB0, V0]` in a PHI node. What's illegal though is having the same basic block multiple times but with different values, and it's exactly what the transform caused. This broke in some rare applications where the pattern arised. Now we cache the `BasicBlock, Value` pairs we're breaking so we can reuse the values and preserve this invariant. Solves SWDEV-399460 Reviewed By: #amdgpu, rovka Differential Revision: https://reviews.llvm.org/D151069	2023-05-22 13:40:26 +02:00
Matt Arsenault	0d0ed9a355	AMDGPU: Pattern match fract instructions in AMDGPUCodeGenPrepare This will allow eliminating the intrinsic uses in the device libraries, which will remove a subtarget dependency on the f16 version of the intrinsic. We previously had some wrong patterns for this under unsafe math which I've removed. Do it in IR partially to take advantage of the much better isKnownNeverNaN handling, and partially out of laziness to avoid repeating this in the DAG and GlobalISel path. Plus I think this should be done much earlier. Ideally this would be in InstCombine, but you can't introduce target intrinsics from a generic instruction rooted pattern.	2023-05-18 23:29:47 +01:00
pvanhout	52a2d07bb3	[AMDGPU] Improve PHI-breaking heuristics in CGP D147786 made the transform more conservative by adding heuristics, which was a good idea. However, the transform got a bit too conservative at times. This caused a surprise in some rocRAND benchmarks because D143731 greatly helped a few of them. For instance, a few xorwow-uniform tests saw a +30% boost in performance after that pass, which was lost when D147786 landed. This patch is an attempt at reaching a middleground that makes the pass a bit more permissive. It continues in the same spirit as D147786 but does the following changes: - PHI users of a PHI node are now recursively checked. When loops are encountered, we consider the PHIs non-breakable. (Considering them breakable had very negative effect in one app I tested) - `shufflevector` is now considered interesting, given that it satisfies a few trivial checks. Reviewed By: arsenm, #amdgpu, jmmartinez Differential Revision: https://reviews.llvm.org/D150266	2023-05-15 09:16:22 +02:00
Matt Arsenault	6a0d0711ce	AMDGPU: Don't try to create pointer bitcasts in load widening	2023-04-29 10:04:33 -04:00
pvanhout	b3b3cb2d2f	[AMDGPU] Less aggressively break large PHIs In some cases, breaking large PHIs can very negatively affect performance (3x more instructions observed in a particular test case). This patch adds some basic profitability heuristics to help with some of these issues without affecting the "good" cases. e.g. avoid breaking PHIs if it causes back-and-forth between vector/scalar form for no good reason. Fixes SWDEV-392803 Fixes SWDEV-393781 Fixes SWDEV-394228 Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D147786	2023-04-14 15:41:26 +02:00
pvanhout	d892521076	[AMDGPU] Break-up large PHIs for DAGISel DAGISel uses CopyToReg/CopyFromReg to lower PHI nodes. With large PHIs, this can result in poor codegen. This is because it introduces a need to have a build_vector before copying the PHI value, and that build_vector may have many undef elements. This can cause very high register pressure and abnormal stack usage in some cases. This scalarization/phi "break-up" can be easily tuned/disabled through CL options in case it's not beneficial for some users. It's also only enabled for DAGIsel and GlobalISel handles PHIs much better (as it works on the whole function). This can both scalarize (break a vector into its elements) and simplify (break a vector into smaller, more manageable subvectors) PHIs. Fixes SWDEV-321581 Reviewed By: kzhuravl Differential Revision: https://reviews.llvm.org/D143731	2023-03-28 09:38:47 +02:00
pvanhout	dbebebf6f6	[AMDGPU] Use UniformityAnalysis in CodeGenPrepare A little extra change was needed in UA because it didn't consider InvokeInst and it made call-constexpr.ll assert. Reviewed By: sameerds, arsenm Differential Revision: https://reviews.llvm.org/D145358	2023-03-06 13:26:51 +01:00
Jay Foad	dcb834843e	[AMDGPU] Split SIModeRegisterDefaults out of AMDGPUBaseInfo. NFC. This is only used by CodeGen. Moving it out of AMDGPUBaseInfo simplifies future changes to make some of it depend on the subtarget. Differential Revision: https://reviews.llvm.org/D144650	2023-02-23 16:38:15 +00:00
Kazu Hirata	64dad4ba9a	Use llvm::bit_cast (NFC)	2023-02-14 01:22:12 -08:00
Jay Foad	6443c0ee02	[AMDGPU] Stop using make_pair and make_tuple. NFC. C++17 allows us to call constructors pair and tuple instead of helper functions make_pair and make_tuple. Differential Revision: https://reviews.llvm.org/D139828	2022-12-14 13:22:26 +00:00
Matt Arsenault	3830e4e58c	AMDGPU: Create poison values instead of undef These placeholders don't care about the finer points on the difference between the two.	2022-11-16 14:47:24 -08:00
Matt Arsenault	838fd611b7	AMDGPU: Fix assertion on <1 x i16> vectors Fixes issue 58331.	2022-10-12 17:25:24 -07:00
Nikita Popov	8e70258b18	[AMDGPUCodeGenPrepare] Check result of ConstantFoldBinaryOpOperands() This function will become fallible once we don't support constant expressions for all binops, so make sure to check the result.	2022-07-04 14:20:23 +02:00
Sebastian Neubauer	6527b2a4d5	[AMDGPU][NFC] Fix typos Fix some typos in the amdgpu backend. Differential Revision: https://reviews.llvm.org/D119235	2022-02-18 15:05:21 +01:00
Craig Topper	cbcbbd6ac8	[ValueTracking][SelectionDAG] Rename ComputeMinSignedBits->ComputeMaxSignificantBits. NFC This function returns an upper bound on the number of bits needed to represent the signed value. Use "Max" to match similar functions in KnownBits like countMaxActiveBits. Rename APInt::getMinSignedBits->getSignificantBits. Keeping the old name around to keep this patch size down. Will do a bulk rename as follow up. Rename KnownBits::countMaxSignedBits->countMaxSignificantBits. Reviewed By: lebedev.ri, RKSimon, spatel Differential Revision: https://reviews.llvm.org/D116522	2022-01-03 11:33:30 -08:00
Craig Topper	361216f3c4	[AMDGPU] Use ComputeMinSignedBits and KnownBits::countMaxActiveBits to simplify some code. NFC Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D116516	2022-01-03 10:09:51 -08:00
Jay Foad	21a1d4cf71	[AMDGPU] Change numBitsSigned for simplicity and document it. NFC. Change numBitsSigned to return the minimum size of a signed integer that can hold the value. This is different by one from the previous result but is more consistent with numBitsUnsigned. Update all callers. All callers are now more consistent between the signed and unsigned cases, and some callers get simpler, especially the ones that deal with quantities like numBitsSigned(LHS) + numBitsSigned(RHS). Differential Revision: https://reviews.llvm.org/D112813	2021-10-29 14:22:06 +01:00
Abinav Puthan Purayil	781dd39b7b	[AMDGPU] Enable 48-bit mul in AMDGPUCodeGenPrepare. We were bailing out of creating 24-bit muls for results wider than 32 bits in AMDGPUCodeGenPrepare. With the 24-bit mulhi intrinsic, this change teaches AMDGPUCodeGenPrepare to generate the 48-bit mul correctly. Differential Revision: https://reviews.llvm.org/D112395	2021-10-26 18:53:07 +05:30
Abinav Puthan Purayil	de3038400b	[AMDGPU] Avoid redundant calls to numBits in AMDGPUCodeGenPrepare::replaceMulWithMul24(). The isU24() and isI24() calls numBits to make its decision. This change replaces them with the internal numBits call so that we can use its result for the > 32 bit width cases. Differential Revision: https://reviews.llvm.org/D111864	2021-10-15 19:49:44 +05:30
Abinav Puthan Purayil	0379263f23	[AMDGPU] Fix width check for signed mul24 generation. This changes fixes a case in which the highest set bit of the original result is at bit 31 and sign-extending the mul24 for it would make the result negative. Differential Revision: https://reviews.llvm.org/D111823	2021-10-15 18:53:41 +05:30
Abinav Puthan Purayil	b3c9d84e5a	[AMDGPU] Fix 24-bit mul intrinsic generation for > 32-bit result. The 24-bit mul intrinsics yields the low-order 32 bits. We should only do the transformation if the operands are known to be not wider than 24 bits and the result is known to be not wider than 32 bits. Differential Revision: https://reviews.llvm.org/D111523	2021-10-14 09:00:19 +05:30
Jacob Lambert	dc6e8dfdfe	[AMDGPU][NFC] Correct typos in lib/Target/AMDGPU/AMDGPU*.cpp files. Test commit for new contributor.	2021-09-20 14:48:50 -07:00
Jay Foad	477b9bc9f7	[AMDGPU] Minor cleanup after D109483. NFC.	2021-09-13 10:27:15 +01:00
Anshil Gandhi	2e5dc4a1ef	[AMDGPU] [CodeGen] Fold negate llvm.amdgcn.class into test mask Implemented the transformation of xor (llvm.amdgcn.class x, mask), -1 into llvm.amdgcn.class(x, ~mask). Added LIT tests as well. Differential Revision: https://reviews.llvm.org/D104049	2021-06-18 13:04:12 -06:00
Nikita Popov	9914200393	[CodeGen] Add missing includes (NFC) These currently rely on the IRBuilder.h include in TargetLowering.h. Make them explicit.	2021-06-06 15:48:27 +02:00
Serge Guelton	d6de1e1a71	Normalize interaction with boolean attributes Such attributes can either be unset, or set to "true" or "false" (as string). throughout the codebase, this led to inelegant checks ranging from if (Fn->getFnAttribute("no-jump-tables").getValueAsString() == "true") to if (Fn->hasAttribute("no-jump-tables") && Fn->getFnAttribute("no-jump-tables").getValueAsString() == "true") Introduce a getValueAsBool that normalize the check, with the following behavior: no attributes or attribute set to "false" => return false attribute set to "true" => return true Differential Revision: https://reviews.llvm.org/D99299	2021-04-17 08:17:33 +02:00
Matt Arsenault	2a0db8d70e	AMDGPU: Use more accurate fast f64 fdiv A raw v_rcp_f64 isn't accurate enough, so start applying correction.	2021-01-21 10:51:36 -05:00
dfukalov	560d7e0411	[NFC][AMDGPU] Split AMDGPUSubtarget.h to R600 and GCN subtargets ... to reduce headers dependency. Reviewed By: rampitec, arsenm Differential Revision: https://reviews.llvm.org/D95036	2021-01-20 22:22:45 +03:00
dfukalov	6a87e9b08b	[NFC][AMDGPU] Reduce include files dependency. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D93813	2021-01-07 22:22:05 +03:00
Simon Pilgrim	1673a08044	SelectionDAG.h - remove unnecessary FunctionLoweringInfo.h include. NFCI. Use forward declarations and move the include down to dependent files that actually use it. This also exposes a number of implicit dependencies on KnownBits.h	2020-09-03 18:33:25 +01:00
Matt Arsenault	75e6f0b3d4	AMDGPU: Add flag to disable promotion of uniform i16 ops This interferes with GlobalISel's much better handling of the situation. This should really be disable for GlobalISel. However, the fallback only re-runs the selection passes, and doesn't go back and rerun any codegen IR passes. I haven't come up with a good solution to this problem.	2020-08-24 14:39:27 -04:00

1 2 3

115 Commits