llvm-project

Author	SHA1	Message	Date
Matt Arsenault	54bda79335	AMDGPU: Simplify and improve sincos matching The first trivial example I tried failed to merge due to the user scan logic. Remove the complicated scan of users handling with distance thresholds, with a same block restriction. The actual expansion of sincos is basically the same size as sin or cos individually. Copy the technique the generic optimization uses, which is to just use the input instruction as the insert point or just insert at the start of the entry block. https://reviews.llvm.org/D156706	2023-08-02 17:48:35 -04:00
Matt Arsenault	b953155b49	AMDGPU: Fix counting debug instructions in execz skip threshold	2023-08-02 08:09:41 -04:00
Mirko Brkusanin	acdc503d6c	[AMDGPU][GlobalISel] Update applyMappingImpl for G_ABS and type v2s16 For G_ABS with type v2s16 and sgpr inputs break down into two s32 G_ABS instructions. Patch by: Acim Maravic Differential Revision: https://reviews.llvm.org/D155867	2023-08-02 12:27:06 +02:00
Mirko Brkusanin	fadf3e7f2b	[AMDGPU][GlobalISel] Update legalizer for G_ABS, G_SMIN, G_SMAX, G_UMIN, G_UMAX There is no need to increase the size of odd sized vectors if they are going to be scalarized by a different rule. Patch by: Acim Maravic Differential Revision: https://reviews.llvm.org/D155865	2023-08-02 12:18:18 +02:00
Jay Foad	c2093b8504	[AMDGPU] Add target features for GDS and GWS GFX9 subtargets from GFX90A onwards lack GDS but still have GWS. Differential Revision: https://reviews.llvm.org/D156713	2023-08-02 09:02:07 +01:00
Matt Arsenault	5dfdd3494b	AMDGPU: Don't try to fold wavefrontsize intrinsic in libcall simplify It's not a libcall so doesn't really belong here to begin with. Relying on checking the target name and explicit features isn't particularly sound either. The library doesn't use the intrinsic anymore, so it doesn't matter anyway.	2023-08-01 18:20:50 -04:00
Matt Arsenault	eb00555c16	AMDGPU: Add more tests for sincos recognition These show both broken cases and cases which are handled too conservatively.	2023-08-01 18:20:50 -04:00
Matt Arsenault	4d42e8b5d1	Reapply "[CodeGen]Allow targets to use target specific COPY instructions for live range splitting" This reverts commit a496c8be6e638ae58bb45f13113dbe3a4b7b23fd. The workaround in c26dfc81e254c78dc23579cf3d1336f77249e1f6 should work around the underlying problem with SUBREG_TO_REG.	2023-07-31 20:15:45 -04:00
Matt Arsenault	5b5bd81b71	AMDGPU: Move placement of RemoveIncompatibleFunctions This should be approximately first and run with other module passes. https://reviews.llvm.org/D155987	2023-07-31 19:22:04 -04:00
Matt Arsenault	db4d6ef9ef	AMDGPU: Directly emit fabs intrinsic instead of new libcall	2023-07-31 19:19:56 -04:00
Matt Arsenault	02a0b11331	AMDGPU: Remove weird usage of implicit operand on COPY For the purpose of the test it works as well to have a use after the copy itself.	2023-07-31 19:16:11 -04:00
Matt Arsenault	0aa439d502	AMDGPU/GlobalISel: Use SGPR results for G_AMDGPU_WAVE_ADDRESS	2023-07-31 19:16:11 -04:00
Matt Arsenault	8a677a7ff0	AMDGPU: Partially respect nobuiltin in libcall simplifier There are more contexts where it's not handled correctly but this is the simplest one. https://reviews.llvm.org/D156682	2023-07-31 10:56:46 -04:00
Sameer Sahasrabuddhe	d9847cde48	[GlobalISel] convergent intrinsics Introduced the convergent equivalent of the existing G_INTRINSIC opcodes: - G_INTRINSIC_CONVERGENT - G_INTRINSIC_CONVERGENT_W_SIDE_EFFECTS Out of the targets that currently have some support for GlobalISel, the patch assumes that the convergent intrinsics only relevant to SPIRV and AMDGPU. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D154766	2023-07-31 12:15:39 +05:30
Jay Foad	e2e3f06813	Revert "[MachineScheduler] Track physical register dependencies per-regunit" This reverts commit 1a54671d5405a39de362e9692ce963c0638023bc. It was causing lit test failures in a LLVM_ENABLE_EXPENSIVE_CHECKS build.	2023-07-29 18:05:25 +01:00
Jay Foad	1a54671d54	[MachineScheduler] Track physical register dependencies per-regunit Change the scheduler's physical register dependency tracking from registers-and-their-aliases to regunits. This has a couple of advantages when subregisters are used: - The dependency tracking is more accurate and creates fewer useless edges in the dependency graph. An AMDGPU example, edited for clarity: SU(0): $vgpr1 = V_MOV_B32 $sgpr0 SU(1): $vgpr1 = V_ADDC_U32 0, $vgpr1 SU(2): $vgpr0_vgpr1 = FLAT_LOAD_DWORDX2 $vgpr0_vgpr1, 0, 0 There is a data dependency on $vgpr1 from SU(0) to SU(1) and from SU(1) to SU(2). But the old dependency tracking code also added a useless edge from SU(0) to SU(2) because it thought that SU(0)'s def of $vgpr1 aliased with SU(2)'s use of $vgpr0_vgpr1. - On targets like AMDGPU that make heavy use of subregisters, each register can have a huge number of aliases - it can be quadratic in the size of the largest defined register tuple. There is a much lower bound on the number of regunits per register, so iterating over regunits is faster than iterating over aliases. The LLVM compile-time tracker shows a tiny overall improvement of 0.03% on X86. I expect a larger compile-time improvement on targets like AMDGPU. Differential Revision: https://reviews.llvm.org/D156552	2023-07-29 15:34:53 +01:00
Jay Foad	5a64c89c8d	[MachineScheduler] Test case for physical register dependencies Differential Revision: https://reviews.llvm.org/D156551	2023-07-29 15:34:53 +01:00
Matt Arsenault	3240ae7034	AMDGPU/GlobalISel: Set dead on scc on manually selected instructions In SelectionDAG InstrEmitter automatically puts dead flags on unused physreg defs everywhere. The generated selectors should also set dead on physreg defs that were not used in the pattern.	2023-07-28 14:14:06 -04:00
Jeffrey Byrnes	391249d1af	[AMDGPU] Allow 8,16 bit sources in calculateSrcByte This is required for many trees produced in practice for i8 CodeGen. Differential Revision: https://reviews.llvm.org/D155864 Change-Id: Iac01d183d9998b15138bdc7a5051e3bed338e7d9	2023-07-28 09:50:21 -07:00
Matt Arsenault	95e5a461f5	AMDGPU: Always custom lower extract_subvector The patterns were ripped out in a4a3ac10cb1a40ccebed4e81cd7e94f1eb71602d so this always needs to be custom lowered. I absolutely hate how difficult it is to write tests for these, I have no doubt there are more of these hidden. Fixes #64142	2023-07-27 08:46:44 -04:00
Vitaly Buka	a496c8be6e	Revert "[CodeGen]Allow targets to use target specific COPY instructions for live range splitting" And dependent commits. Details in D150388. This reverts commit 825b7f0ca5f2211ec3c93139f98d1e24048c225c. This reverts commit 7a98f084c4d121244ef7286bc6503b6a181d446e. This reverts commit b4a62b1fa546312d882fa12dfdcd015177d66826. This reverts commit b7836d856206ec39509d42529f958c920368166b. No conflicts in the code, few tests had conflicts in autogenerated CHECKs: llvm/test/CodeGen/Thumb2/mve-float32regloops.ll llvm/test/CodeGen/AMDGPU/fix-frame-reg-in-custom-csr-spills.ll Reviewed By: alexfh Differential Revision: https://reviews.llvm.org/D156381	2023-07-26 22:13:32 -07:00
Pravin Jagtap	1462053608	[AMDGPU] Propagate constants for llvm.amdgcn.wave.reduce.umin/umax Reviewed By: arsenm, #amdgpu Differential Revision: https://reviews.llvm.org/D156077	2023-07-26 23:46:01 -04:00
pvanhout	a8aabba587	[AMDGPU] Fix PromoteAlloca Subvector Stores for Single Elements The previous condition was incorrect in some cases, like storing <2 x i32> into a double. If IndexVal was >0, we ended up never storing anything. Reviewed By: #amdgpu, arsenm Differential Revision: https://reviews.llvm.org/D156308	2023-07-26 13:21:21 +02:00
pvanhout	6a767fbc36	[AMDGPU] Precommit tests for D156308 Also includes another testcase that's unrelated, it's just a sanity check. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D156309	2023-07-26 13:21:20 +02:00
Corbin Robeck	7a4968b5a3	[AMDGPU] Add dynamic stack bit info to kernel-resource-usage Rpass output In code object 5 (https://llvm.org/docs/AMDGPUUsage.html#code-object-v5-metadata) the AMDGPU backend added the .uses_dynamic_stack bit to the kernel meta data to identity kernels which have compile time indeterminable stack usage (indirect function calls and recursion mainly). This patch adds this information to the output of the kernel-resource-usage remarks. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D156040 Author: Corbin Robeck <corbin.robeck@amd.com>	2023-07-25 12:20:13 -07:00
Kevin P. Neal	76c22b18ea	[FPEnv][AMDGPU] Correct strictfp tests. Correct AMDGPU strictfp tests to follow the rules documented in the LangRef: https://llvm.org/docs/LangRef.html#constrained-floating-point-intrinsics Mostly these tests just needed the strictfp attribute on function definitions. I've also removed the strictfp attribute from uses of the constrained intrinsics because it comes by default since D154991, but I only did this in tests I was changing anyway. I also removed attributes added to declare lines of intrinsics. The attributes of intrinsics cannot be changed in a test so I eliminated attempts to do so. Test changes verified with D146845.	2023-07-25 13:24:46 -04:00
Matt Arsenault	e3fd8f83a8	AMDGPU: Correctly expand f64 sqrt intrinsic rocm-device-libs and llpc were avoiding using f64 sqrt intrinsics in favor of their own expansions. Port the expansion into the backend. Both of these users should be updated to call the intrinsic instead. The library and llpc expansions are slightly different. llpc uses an ldexp to do the scale; the library uses a multiply. Use ldexp to do the scale instead of the multiply. I believe v_ldexp_f64 and v_mul_f64 are always the same number of cycles, but it's cheaper to materialize the 32-bit integer constant than the 64-bit double constant. The libraries have another fast version of sqrt which will be handled separately. I am tempted to do this in an IR expansion instead. In the IR we could take advantage of computeKnownFPClass to avoid the 0-or-inf argument check.	2023-07-25 07:54:11 -04:00
Matt Arsenault	47b3ada432	AMDGPU: Add more sqrt f64 lowering tests Almost all permutations of the flags are potentially relevant.	2023-07-25 07:54:11 -04:00
pvanhout	3cd4afce5b	[AMDGPU] Allow vector access types in PromoteAllocaToVector Depends on D152706 Solves SWDEV-408279 Reviewed By: #amdgpu, arsenm Differential Revision: https://reviews.llvm.org/D155699	2023-07-25 07:44:48 +02:00
pvanhout	3890a3b113	[AMDGPU] Use SSAUpdater in PromoteAlloca This allows PromoteAlloca to not be reliant on a second SROA run to remove the alloca completely. It just does the full transformation directly. Note PromoteAlloca is still reliant on SROA running first to canonicalize the IR. For instance, PromoteAlloca will no longer handle aggregate types because those should be simplified by SROA before reaching the pass. Reviewed By: #amdgpu, arsenm Differential Revision: https://reviews.llvm.org/D152706	2023-07-25 07:44:47 +02:00
Matt Arsenault	0d797b71eb	RegisterCoaleser: Fix empty subrange verifier error In this example an implicit def had live-out undef subrange defs. After coalescing with the def from a previous block, the undef-defed lanes are no longer live out of the block in the new interval. An empty subrange was tenatively created for these lanes, but it must be deleted.	2023-07-24 12:18:34 -04:00
Matt Arsenault	2a53b6c06b	RegisterCoalescer: Fix verifier error on redef of subregister for live out implicit_defs A live out implicit_def wasn't deleted, but the subranges weren't correctly updated. The main range was correct but the def corresponding to the initial main range def instruction was missing from the lanes redefined in another block. The written lanes are not quite the same as the valid lanes in the case of an implicit_def. Fixes verifier error in blender. There is an additional verifier in some of the testcase variants where an empty subrange remains.	2023-07-24 12:18:34 -04:00
Matt Arsenault	e561e7cb48	AMDGPU: Implement combineRepeatedFPDivisors	2023-07-24 11:19:36 -04:00
Pravin Jagtap	d163b76ce3	[AMDGPU] Fix llvm.amdgcn.wave.reduce.umax/umin MIR tests Fixes the MIR tests reported in https://lab.llvm.org/buildbot/#/builders/16/builds/51955 Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D156125	2023-07-24 10:19:37 -04:00
Pravin Jagtap	c48ed93cf8	[AMDGPU] Add llvm.amdgcn.wave.reduce.umin/umax Intrinsic. When input to intrinsic is uniform value, reduced value is same as input whereas if input value is divergent we need to iterate over all active lanes of WaveFront to perform the reduction. The control flow for a `loop` has been set up, which iterates over `only` active lanes to perform reduction. Introduced WAVE_REDUCE_UMIN_PSEUDO_U32 and WAVE_REDUCE_UMAX_PSEUDO_U32 Pseudos which are lowered Post-ISel (in `EmitInstrWithCustomInserter `). Reviewed By: arsenm, #amdgpu Differential Revision: https://reviews.llvm.org/D154858	2023-07-24 00:06:00 -04:00
Matt Arsenault	8406c3568a	AMDGPU: Implement new 2ulp fdiv lowering Extends the new frexp scaled reciprocal to the general case. The reciprocal case is just the same thing when frexp of 1 is constant folded. Could probably clean up the code to rely on that constant folding. Improves results for the IEEE path for the default OpenCL division. We used to only emit the fdiv.fast intrinsic with a 2.5 ulp accuracy threshold with DAZ, which uses explicit range checks. This gives us a better fast option with the default IEEE behavior.	2023-07-21 18:55:42 -04:00
Matt Arsenault	6699c37028	AMDGPU: Refactor AMDGPUCodeGenPrepare fdiv handling NFC-ish. Does trigger some reordering of the fdiv scalarization. Also skips scalarizing in more cases where nothing was going to happen. We can still scalarize in some no-op edge cases. https://reviews.llvm.org/D155740	2023-07-21 18:55:42 -04:00
Matt Arsenault	8287f3af9d	AMDGPU: Overhaul and improve rcp and rsq f32 formation The highlight change is a new denormal safe 1ulp lowering which uses rcp after using frexp to perform input scaling. This saves 2 instructions compared to other implementations which performed an explicit denormal range change. This improves the OpenCL default, and requires a flag for HIP. I don't believe there's any flag wired up for OpenMP to emit the necessary fpmath metadata. This provides several improvements and changes that were hard to separate without regressing one case or another. Disturbingly the OpenCL conformance test seems to have the reciprocal test commented out. I locally hacked it back in to test this. Starts introducing f32 rsq intrinsics in AMDGPUCodeGenPrepare. Like the rcp case, we could do this in codegen if !fpmath were preserved (although we would lose some computeKnownFPClass tricks). Start requiring contract flags to form rsq. The rsq fusion actually improves the result from ~2ulp to ~1ulp. We have some older fusion in codegen which only keys off unsafe math which should be refined. Expand rsq patterns by checking for denormal inputs and pre/post multiplying like the current library code does. We also take advantage of computeKnownFPClass to avoid the scaling when we can statically prove the input cannot be a denormal. We could do the same for the rcp case, but unlike rsq a large input can underflow to denormal. We need additional upper bound exponent checks on the input in order to do the same for rcp. This rsq handling also now starts handling the negated case. We introduce rsq with an fneg. In the case the fneg doesn't fold into its user, it's a neutral change but provides improvement if it is foldable as a source modifier. Also starts respecting the arcp attribute properly, and more strictly interprets afn. We were previously interpreting afn as implying you could do the reciprocal expansion of an fdiv. The codegen handling of these also needs to be revisited. This also effectively introduces the optimization combineRepeatedFPDivisors enables, just done in the IR instead (and only for f32). This is almost across the board better. The one minor regression is for gfx6/buggy frexp case where for multiple reciprocals, we could previously reuse rematerialized constants per instance (it's neutral for a single rcp). The fdiv.fast and sqrt handling need to be revisited next. https://reviews.llvm.org/D155593	2023-07-21 16:35:53 -04:00
Matt Arsenault	37512d7629	AMDGPU: Add baseline test for fdiv combine	2023-07-21 16:04:12 -04:00
Jay Foad	e45a0c2994	[AMDGPU][RFC] Update isLegalAddressingMode for GFX9 SMEM signed offsets Differential Revision: https://reviews.llvm.org/D155587	2023-07-21 10:56:43 +01:00
Jay Foad	787bef0bee	[AMDGPU] Add tests for SMEM addressing modes in CodeGenPrepare Differential Revision: https://reviews.llvm.org/D155854	2023-07-21 10:56:43 +01:00
Matt Arsenault	d33ab05467	AMDGPU: Add flag to disable fdiv processing in IR pass We kind of have to have multiple implementations of fdiv split between the two selectors with some pre-processing. Add yet another test to check for consistency of interpretation of flag combinations. We have quite a bit of test redundancy here already, but there are so many possible interesting permutations it's unwieldy to cover every detail in any one of them. We have a number of overlapping fdiv tests but it's hard to follow everything going on as it is.	2023-07-20 19:51:15 -04:00
Matt Arsenault	b2d58b596c	AMDGPU: Expand rsq testing to cover contract flag The 1.0/sqrt(x) -> rsq(x) fold increases precision and probably needs a contract flag.	2023-07-20 19:51:15 -04:00
Matt Arsenault	fb54afd1b7	AMDGPU: Fold fsub [+-0] into fneg when folding source modifiers This isn't always folded to fneg for a freestanding fsub depending on the denormal mode. When matching source modifiers, we're implicitly canonicalizing the input so we can fold it here. Doesn't bother handling the VOP3P case since it's only relevant with DAZ, which nobody really uses with f16. For f64, tests show an existing bug where DAGCombiner tries to respect the denormal mode for fsub -0, x, but not after it's lowered to fadd -0, (fneg x). Either the fold is wrong or we shouldn't restrict the fsub case based on the denormal mode. https://reviews.llvm.org/D155652	2023-07-20 19:29:40 -04:00
Matt Arsenault	881e9f2934	AMDGPU: Regenerate test checks Mostly a workaround for recent reverts in update_test_checks	2023-07-20 19:26:35 -04:00
Matt Arsenault	ca34f1bdcd	AMDGPU: Add baseline test for folding fsub into fneg modifiers	2023-07-20 18:29:35 -04:00
Matt Arsenault	0295513238	AMDGPU: Filter out contract flags when lowering exp It is unsafe to contract the fsub into the fmul. It also increases code size by duplicating a constant.	2023-07-20 18:14:24 -04:00
Matt Arsenault	076bc374fc	AMDGPU: Add some new baseline tests for exp lowering	2023-07-20 18:14:24 -04:00
Jingu Kang	351b4c17dd	Revert "[MachineLICM] Handle Subloops" This reverts commit 50dd383d08670960540fecb4b48c0f0429fbfba3.	2023-07-20 17:12:25 +01:00
Jingu Kang	50dd383d08	[MachineLICM] Handle Subloops Following discussion on https://reviews.llvm.org/D154205, make MachineLICM pass handle subloops with only visiting outmost loop's blocks once. Differential Revision: https://reviews.llvm.org/D154205	2023-07-20 16:39:13 +01:00

1 2 3 4 5 ...

6628 Commits