llvm-project

Author	SHA1	Message	Date
Jun Wang	c4e517f59c	[AMDGPU] Adding the amdgpu_num_work_groups function attribute (#79035 ) A new function attribute named amdgpu_num_work_groups is added. This attribute, which consists of three integers, allows programmers to let the compiler know the number of workgroups to be launched in each of the three dimensions and do optimizations based on that information. --------- Co-authored-by: Jun Wang <jun.wang7@amd.com>	2024-03-12 10:30:39 -07:00
Matt Arsenault	bd72ebd8d1	AMDGPU: Add some more mfma hazard recognizer tests (#84727 )	2024-03-12 22:05:47 +05:30
Jake Egan	fa1d13590c	[AIX][tests] Disable failing tests on AIX These new tests are failing on the AIX bot because the -I option isn't supported. Disable these tests for now until they can be fixed.	2024-03-12 12:11:18 -04:00
Pierre van Houtryve	d4569d42b5	[AMDGPU] Let LowerModuleLDS run twice on the same module (#81729 ) If all variables in the module are absolute, this means we're running the pass again on an already lowered module, and that works. If none of them are absolute, lowering can proceed as usual. Only diagnose cases where we have a mix of absolute/non-absolute GVs, which means we added LDS GVs after lowering, which is broken. See #81491 Split from #75333	2024-03-11 09:20:01 +01:00
AtariDreams	4e0e9b17c6	[SelectionDAG] Switch to LiveRegUnits (#84197 )	2024-03-11 12:47:39 +05:30
Carl Ritson	4a21e3afa2	[LiveIntervals] repairIntervalsInRange: recompute width changes (#78564 ) Extend repairIntervalsInRange to completely recompute the interva for a register if subregister defs exist without precise subrange matches (LaneMask exactly matching subregister). This occurs when register sequences are lowered to copies such that the size of the copies do not match any uses of the subregisters formed (i.e. during twoaddressinstruction). The subranges without this change are probably legal, but do not match those generated by live interval computation. This creates problems with other code that assumes subranges precisely cover all subregisters defined, e.g. shrinkToUses().	2024-03-11 15:24:17 +09:00
Carl Ritson	d9e6aa7048	[AMDGPU] Update LiveInterval def index for early-clobber (#79285 ) On converting an instruction to an early-clobber definition in convertToThreeAddress, we must also update live intervals for the register to start at the early-clobber index.	2024-03-11 14:54:11 +09:00
Jay Foad	fd3eaf76ba	[GISel] Enforce G_PTR_ADD RHS type matching index size for addr space (#84352 )	2024-03-09 09:07:22 +00:00
Shilei Tian	e963d0740e	[AMDGPU] Replace `isInlinableLiteral16` with specific version (#84402 ) The current implementation of `isInlinableLiteral16` assumes, a 16-bit inlinable literal is either an `i16` or a `fp16`. This is not always true because of `bf16`. However, we can't tell `fp16` and `bf16` apart by just looking at the value. This patch splits `isInlinableLiteral16` into three versions, `i16`, `fp16`, `bf16` respectively, and call the corresponding version.	2024-03-08 14:49:52 -05:00
Pierre van Houtryve	4b1910b11d	[GlobalISel][AMDGPU] Import patterns with multiple defs (#84171 ) Fixes #63216	2024-03-08 09:39:10 +01:00
Fangrui Song	66bd3cd75b	[AMDGPU,test] Change llc -march= to -mtriple= PR #75982 had been created before these tests were added, therefore some test were not updated.	2024-03-07 19:09:18 -08:00
David Green	44be5a7fdc	[Codegen] Make Width in getMemOperandsWithOffsetWidth a LocationSize. (#83875 ) This is another part of #70452 which makes getMemOperandsWithOffsetWidth use a LocationSize for Width, as opposed to the unsigned it currently uses. The advantages on it's own are not super high if getMemOperandsWithOffsetWidth usually uses known sizes, but if the values can come from an MMO it can help be more accurate in case they are Unknown (and in the future, scalable).	2024-03-06 17:40:13 +00:00
Krzysztof Drewniak	6540f1635a	[AMDGPU] Add IR-level pass to rewrite away address space 7 (#77952 ) This commit adds the -lower-buffer-fat-pointers pass, which is applicable to all AMDGCN compilations. The purpose of this pass is to remove the type `ptr addrspace(7)` from incoming IR. This must be done at the LLVM IR level because `ptr addrspace(7)`, as a 160-bit primitive type, cannot be correctly handled by SelectionDAG. The detailed operation of the pass is described in comments, but, in summary, the removal proceeds by: 1. Rewriting loads and stores of ptr addrspace(7) to loads and stores of i160 (including vectors and aggregates). This is needed because the in-register representation of these pointers will stop matching their in-memory representation in step 2, and so ptrtoint/inttoptr operations are used to preserve the expected memory layout 2. Mutating the IR to replace all occurrences of `ptr addrspace(7)` with the type `{ptr addrspace(8), ptr addrspace(6) }`, which makes the two parts of a buffer fat pointer (the 128-bit address space 8 resource and the 32-bit address space 6 offset) visible in the IR. This also impacts the argument and return types of functions. 3. Splitting the resource and offset parts. All instructions that produce or consume buffer fat pointers (like GEP or load) are rewritten to produce or consume the resource and offset parts separately. For example, GEP updates the offset part of the result and a load uses the resource and offset parts to populate the relevant llvm.amdgcn.raw.ptr.buffer.load intrinsic call. At the end of this process, the original mutated instructions are replaced by their new split counterparts, ensuring no invalidly-typed IR escapes this pass. (For operations like call, where the struct form is needed, insertelement operations are inserted). Compared to LGC's PatchBufferOp ( `32cda89776/lgc/patch/PatchBufferOp.cpp` ): this pass - Also handles vectors of ptr addrspace(7)s - Also handles function boundaries - Includes the same uniform buffer optimization for loops and conditionals - Does not handle memcpy() and friends (this is future work) - Does not break up large loads and stores into smaller parts. This should be handled by extending the legalization of .buffer.{load,store} to handle larger types by producing multiple instructions (the same way ordinary LOAD and STORE are legalized). That work is planned for a followup commit. - Does not* have special logic for handling divergent buffer descriptors. The logic in LGC is, as far as I can tell, incorrect in general, and, per discussions with @nhaehnle, isn't widely used. Therefore, divergent descriptors are handled with waterfall loops later in legalization. As a final matter, this commit updates atomic expansion to treat buffer operations analogously to global ones. (One question for reviewers: is the new pass is the right place? Should it be later in the pipeline?) Differential Revision: https://reviews.llvm.org/D158463	2024-03-06 09:49:58 -06:00
Mirko Brkušanin	1fd1f4c0e1	[AMDGPU] Handle amdgpu.last.use metadata (#83816 ) Convert !amdgpu.last.use metadata into MachineMemOperand for last use and handle it in SIMemoryLegalizer similar to nontemporal and volatile.	2024-03-06 16:33:52 +01:00
Emma Pilkington	4490003a22	[AMDGPU] Rename COV module flag to amdhsa_code_object_version (#79905 ) The previous name 'amdgpu_code_object_version', was misleading since this is really a property of the HSA OS. The new spelling also matches the asm directive I added in bc82cfb.	2024-03-06 09:51:48 -05:00
Joseph Huber	1fc5e50ceb	[AMDGPU] Implement 'llvm.get.fpenv' and 'llvm.set.fpenv' (#83906 ) Summary: This patch implements the LLVM floating point environment control intrinsics and also exposes it through clang. We encode the floating point environment as a 64-bit value that simply concatenates the values of the mode registers and the current trap status. We only fetch the bits relevant for floating point instructions. That is, rounding mode, denormalization mode, ieee, dx10 clamp, debug, enabled traps, f16 overflow, and active exceptions.	2024-03-06 08:11:54 -06:00
Shilei Tian	e9c1dbb408	Revert "[AMDGPU] Replace `isInlinableLiteral16` with specific version (#81345 )" This reverts commit 530f0e64ec11327879c44f2fd55c7c28efdbaa2d because it breaks downstream.	2024-03-06 08:42:54 -05:00
Pierre van Houtryve	52d5b8e02d	[AMDGPU] Don't form sext/abs/neg fp8 cvt (#83843 ) gfx940 does not allow abs/sext/neg on v_cvt_fp8/bf8 & pk variants. Fixes SWDEV-447468	2024-03-06 10:38:20 +01:00
Sameer Sahasrabuddhe	60822637bf	Restore "Implement convergence control in MIR using SelectionDAG (#71785 )" This restores commit c7fdd8c11e54585dc9d15d63de9742067e0506b9. Previously reverted in f010b1bef4dda2c7082cbb41dbabf1f149cce306. LLVM function calls carry convergence control tokens as operand bundles, where the tokens themselves are produced by convergence control intrinsics. This patch implements convergence control tokens in MIR as follows: 1. Introduce target-independent ISD opcodes and MIR opcodes for convergence control intrinsics. 2. Model token values as untyped virtual registers in MIR. The change also introduces an additional ISD opcode CONVERGENCECTRL_GLUE and a corresponding machine opcode with the same spelling. This glues the convergence control token to SDNodes that represent calls to intrinsics. The glued token is later translated to an implicit argument in the MIR. The lowering of calls to user-defined functions is target-specific. On AMDGPU, the convergence control operand bundle at a non-intrinsic call is translated to an explicit argument to the SI_CALL_ISEL instruction. Post-selection adjustment converts this explicit argument to an implicit argument on the SI_CALL instruction.	2024-03-06 12:19:32 +05:30
Noah Goldstein	17162b61c2	[KnownBits] Make `nuw` and `nsw` support in `computeForAddSub` optimal Just some improvements that should hopefully strengthen analysis. Closes #83580	2024-03-05 12:59:58 -06:00
bcahoon	4cf8b298cf	[AMDGPU][PromoteAlloca] Correctly handle a variable vector index (#83597 ) The promote alloca to vector transformation assumes that the vector index is a constant value. If it is not a constant, then either an assert occurs or the tranformation generates an incorrect index.	2024-03-05 08:18:17 -06:00
Mitch Phillips	f010b1bef4	Revert "Restore "Implement convergence control in MIR using SelectionDAG (#71785 )"" This reverts commit c7fdd8c11e54585dc9d15d63de9742067e0506b9. Reason: Broke the sanitizer buildbots. See the comments at https://github.com/llvm/llvm-project/pull/71785 for more information.	2024-03-04 17:05:34 +01:00
Mirko Brkušanin	27ce5121ee	[AMDGPU] Fix setting nontemporal in memory legalizer (#83815 ) Iterator MI can advance in insertWait() but we need original instruction to set temporal hint. Just move it before handling volatile.	2024-03-04 15:05:31 +01:00
Shilei Tian	530f0e64ec	[AMDGPU] Replace `isInlinableLiteral16` with specific version (#81345 )	2024-03-04 08:40:42 -05:00
Mirko Brkušanin	982e9022ca	[AMDGPU] Add GFX12 memory legalizer tests (#83814 )	2024-03-04 11:22:04 +01:00
Sameer Sahasrabuddhe	c7fdd8c11e	Restore "Implement convergence control in MIR using SelectionDAG (#71785 )" Original commit 79889734b940356ab3381423c93ae06f22e772c9. Perviously reverted in commit a2afcd5721869d1d03c8146bae3885b3385ba15e. LLVM function calls carry convergence control tokens as operand bundles, where the tokens themselves are produced by convergence control intrinsics. This patch implements convergence control tokens in MIR as follows: 1. Introduce target-independent ISD opcodes and MIR opcodes for convergence control intrinsics. 2. Model token values as untyped virtual registers in MIR. The change also introduces an additional ISD opcode CONVERGENCECTRL_GLUE and a corresponding machine opcode with the same spelling. This glues the convergence control token to SDNodes that represent calls to intrinsics. The glued token is later translated to an implicit argument in the MIR. The lowering of calls to user-defined functions is target-specific. On AMDGPU, the convergence control operand bundle at a non-intrinsic call is translated to an explicit argument to the SI_CALL_ISEL instruction. Post-selection adjustment converts this explicit argument to an implicit argument on the SI_CALL instruction.	2024-03-04 13:28:04 +05:30
Bjorn Pettersson	da591d390e	[GlobalISel][TableGen] Take first result for multi-output instructions (#81130 ) Previously, tblgen would reject patterns where one of its nested instructions produced more than one result. These arise when the instruction definition contains 'outs' as well as 'Defs'. This patch fixes that by always taking the first result, which is how these situations are handled in SelectionIDAG. Original patch: https://reviews.llvm.org/D86617 Continued as: https://github.com/llvm/llvm-project/pull/81130	2024-03-02 20:10:02 +01:00
Pierre van Houtryve	756166e342	[AMDGPU] Improve detection of non-null addrspacecast operands (#82311 ) Use IR analysis to infer when an addrspacecast operand is nonnull, then lower it to an intrinsic that the DAG can use to skip the null check. I did this using an intrinsic as it's non-intrusive. An alternative would have been to allow something like `!nonnull` on `addrspacecast` then lower that to a custom opcode (or add an operand to the addrspacecast MIR/DAG opcodes), but it's a lot of boilerplate for just one target's use case IMO. I'm hoping that when we switch to GISel that we can move all this logic to the MIR level without losing info, but currently the DAG doesn't see enough so we need to act in CGP. Fixes: SWDEV-316445	2024-03-01 14:01:10 +01:00
Nick Anderson	ba8e9ace13	[AMDGPU] promote i1 arg type for amdgpu_cs (#82971 ) fixes #68087 Not sure where to put regression tests for this pr? Also, should i1 args not in reg also be promoted?	2024-03-01 14:25:46 +05:30
Leon Clark	5b07fd4799	[AMDGPU] Fix OpenCL conformance test failures for ctlz. (#83170 ) Remove LSH transform and restore previous lowering. Fixes conformance issue in [77615](https://github.com/llvm/llvm-project/pull/77615) where OpenCL integer_ops tests fail for integer_clz. Co-authored-by: Leon Clark <leoclark@amd.com>	2024-02-29 22:28:13 +00:00
Petar Avramovic	0d572c41f9	AMDGPU\GlobalISel: remove amdgpu-global-isel-risky-select flag (#83426 ) AMDGPUInstructionSelector should no longer attempt to select S1 G_PHIs. Remove MIR test that attempts to inst-select divergent vcc(S1) G_PHI. Lane mask merging algorithm for GlobalISel is now responsible for selecting divergent S1 G_PHIs in AMDGPUGlobalISelDivergenceLowering. Uniform S1 G_PHIs should be lowered to S32 G_PHIs in reg bank select pass. In summary S1 G_PHIs should not reach AMDGPUInstructionSelector.	2024-02-29 15:38:54 +01:00
Petar Avramovic	6c2eec5cea	AMDGPU/GlobalISel: lane masks merging (#73337 ) Basic implementation of lane mask merging for GlobalISel. Lane masks on GlobalISel are registers with sgpr register class and S1 LLT - required by machine uniformity analysis. Implements equivalent of lowerPhis from SILowerI1Copies.cpp in: patch 1: https://github.com/llvm/llvm-project/pull/75340 patch 2: https://github.com/llvm/llvm-project/pull/75349 patch 3: https://github.com/llvm/llvm-project/pull/80003 patch 4: https://github.com/llvm/llvm-project/pull/78431 patch 5: is in this commit: AMDGPU/GlobalISelDivergenceLowering: constrain incoming registers Previously, in PHIs that represent lane masks, incoming registers taken as-is were not selected as lane masks. Such registers are not being merged with another lane mask and most often only have S1 LLT. Implement constrainAsLaneMask by constraining incoming registers taken as-is with lane mask attributes, essentially transforming them to lane masks. This is final step in having PHI instructions created in this pass to be fully instruction-selected.	2024-02-29 13:57:59 +01:00
Matt Arsenault	6cfd3439d4	APFloat: Fix signed zero handling in minnum/maxnum (#83376 ) Follow the 2019 rules and order -0 as less than +0 and +0 as greater than -0. As currently defined this isn't required for the intrinsics, but is a better QoI. This will avoid the workaround in libc added by #83158	2024-02-29 16:51:33 +05:30
Shilei Tian	191fd2d9db	[NFC][AMDGPU] Move the rem tests in `div_i128.ll` into `rem_i128.ll` (#83307 )	2024-02-28 18:47:02 -05:00
Petar Avramovic	3e35ba53e2	AMDGPU/GFX12: Insert waitcnts before stores with scope_sys (#82996 ) Insert waitcnts for loads and atomics before stores with system scope. Scope is field in instruction encoding and corresponds to desired coherence level in cache hierarchy. Intrinsic stores can set scope in cache policy operand. If volatile keyword is used on generic stores memory legalizer will set scope to system. Generic stores, by default, get lowest scope level. Waitcnts are not required if it is guaranteed that memory is cached. For example vulkan shaders can guarantee this. TODO: implement flag for frontends to give us a hint not to insert waits. Expecting vulkan flag to be implemented as vulkan:private MMRA.	2024-02-28 16:18:04 +01:00
Valery Pykhtin	a845ea3878	[AMDGPU] Fix SDWA 'preserve' transformation for instructions in different basic blocks. (#82406 ) This fixes crash when operand sources for V_OR instruction reside in different basic blocks.	2024-02-28 14:47:33 +01:00
Jeffrey Byrnes	cf1c97b2d2	[AMDGPU] Do not attempt to fallback to default mutations (#83208 ) IGLP itself will be in SavedMutations via mutations added during Scheduler creation, thus falling back results in reapplying IGLP. In PostRA scheduling, if we have multiple regions with IGLP instructions, then we may have infinite loop. Disable the feature for now.	2024-02-27 18:04:59 -08:00
choikwa	04db60d150	[AMDGPU] Prevent hang in SIFoldOperands by caching uses (#82099 ) foldOperands() for REG_SEQUENCE has recursion that can trigger an infinite loop as the method can modify the operand order, which messes up the range-based for loop. This patch fixes the issue by caching the uses for processing beforehand, and then iterating over the cache rather using the instruction iterator.	2024-02-27 09:13:59 -06:00
Matt Arsenault	ca66f7469f	AMDGPU: Merge tests for llvm.amdgcn.dispatch.id	2024-02-27 18:42:40 +05:30
Matt Arsenault	2e4643a53e	AMDGPU: Regenerate baseline test checks	2024-02-27 18:42:40 +05:30
michaelselehov	56ad6d1939	[MachineLICM] Hoist COPY instruction only when user can be hoisted (#81735 ) befa925acac8fd6a9266e introduced preliminary hoisting of COPY instructions when the user of the COPY is inside the same loop. That optimization appeared to be too aggressive and hoisted too many COPY's greatly increasing register pressure causing performance regressions for AMDGPU target. This is intended to fix the regression by hoisting COPY instruction only if either: - User of COPY can be hoisted (other args are invariant) or - Hoisting COPY doesn't bring high register pressure	2024-02-27 12:31:29 +00:00
Matt Arsenault	e7900e695e	AMDGPU: Regenerate baseline mir tests	2024-02-27 10:44:53 +05:30
Noah Goldstein	15a7de697a	[SelectionDAG] Support sign tracking through `{S\|U}INT_TO_FP` Just a minimal amount of easily provable tracking. Proofs: https://alive2.llvm.org/ce/z/RQYbdw Closes #82808 Alive2 to has an issue with `(sitofp i1)`, but it can be verified by hand: https://godbolt.org/z/qKr7hT7s9	2024-02-26 15:35:38 -06:00
Jeffrey Byrnes	113052b2b0	[AMDGPU] Prefer lower total register usage in regions with spilling Change-Id: Ia5c434b0945bdcbc357c5e06c3164118fc91df25	2024-02-26 12:19:52 -08:00
Petar Avramovic	433f8e741e	MachineSSAUpdater: use all vreg attributes instead of reg class only (#78431 ) When initializing MachineSSAUpdater save all attributes of current virtual register and create new virtual registers with same attributes. Now new virtual registers have same both register class or bank and LLT. Previously new virtual registers had same register class but LLT was not set (LLT was set to default/empty LLT). Required by GlobalISel for AMDGPU, new 'lane mask' virtual registers created by MachineSSAUpdater need to have both register class and LLT. patch 4 from: https://github.com/llvm/llvm-project/pull/73337	2024-02-26 13:46:13 +01:00
Jack Styles	28233408a2	[CodeGen] [ARM] Make RISC-V Init Undef Pass Target Independent and add support for the ARM Architecture. (#77770 ) When using Greedy Register Allocation, there are times where early-clobber values are ignored, and assigned the same register. This is illeagal behaviour for these intructions. To get around this, using Pseudo instructions for early-clobber registers gives them a definition and allows Greedy to assign them to a different register. This then meets the ARM Architecture Reference Manual and matches the defined behaviour. This patch takes the existing RISC-V patch and makes it target independent, then adds support for the ARM Architecture. Doing this will ensure early-clobber restraints are followed when using the ARM Architecture. Making the pass target independent will also open up possibility that support other architectures can be added in the future.	2024-02-26 12:12:31 +00:00
Rishabh Bali	fe42e72db2	[CodeGen] Port AtomicExpand to new Pass Manager (#71220 ) Port the `atomicexpand` pass to the new Pass Manager. Fixes #64559	2024-02-25 18:42:22 +05:30
Jeffrey Byrnes	8f2bd8ae68	[AMDGPU] Introduce iglp_opt(2): Generalized exp/mfma interleaving for select kernels (#81342 ) This implements the basic pipelining structure of exp/mfma interleaving for better extensibility. While it does have improved extensibility, there are controls which only enable it for DAGs with certain characteristics (matching the DAGs it has been designed against).	2024-02-23 17:13:20 -08:00
Pierre van Houtryve	4235e44d4c	[GlobalISel] Constant-fold G_PTR_ADD with different type sizes (#81473 ) All other opcodes in the list are constrained to have the same type on both operands, but not G_PTR_ADD. Fixes #81464	2024-02-22 13:15:26 +01:00
Nick Anderson	8bd327d6fe	[AMDGPU][GlobalISel] Add fdiv / sqrt to rsq combine (#78673 ) Fixes #64743	2024-02-22 09:47:36 +01:00

1 2 3 4 5 ...

7240 Commits