7240 Commits

Author SHA1 Message Date
Jun Wang
c4e517f59c
[AMDGPU] Adding the amdgpu_num_work_groups function attribute (#79035)
A new function attribute named amdgpu_num_work_groups is added. This
attribute, which consists of three integers, allows programmers to let
the compiler know the number of workgroups to be launched in each of the
three dimensions and do optimizations based on that information.

---------

Co-authored-by: Jun Wang <jun.wang7@amd.com>
2024-03-12 10:30:39 -07:00
Matt Arsenault
bd72ebd8d1
AMDGPU: Add some more mfma hazard recognizer tests (#84727) 2024-03-12 22:05:47 +05:30
Jake Egan
fa1d13590c [AIX][tests] Disable failing tests on AIX
These new tests are failing on the AIX bot because the -I option isn't supported.

Disable these tests for now until they can be fixed.
2024-03-12 12:11:18 -04:00
Pierre van Houtryve
d4569d42b5
[AMDGPU] Let LowerModuleLDS run twice on the same module (#81729)
If all variables in the module are absolute, this means we're running
the pass again on an already lowered module, and that works.
If none of them are absolute, lowering can proceed as usual.
Only diagnose cases where we have a mix of absolute/non-absolute GVs,
which means we added LDS GVs after lowering, which is broken.

See #81491
Split from #75333
2024-03-11 09:20:01 +01:00
AtariDreams
4e0e9b17c6
[SelectionDAG] Switch to LiveRegUnits (#84197) 2024-03-11 12:47:39 +05:30
Carl Ritson
4a21e3afa2
[LiveIntervals] repairIntervalsInRange: recompute width changes (#78564)
Extend repairIntervalsInRange to completely recompute the interva for a
register if subregister defs exist without precise subrange matches
(LaneMask exactly matching subregister).
This occurs when register sequences are lowered to copies such that the
size of the copies do not match any uses of the subregisters formed
(i.e. during twoaddressinstruction).

The subranges without this change are probably legal, but do not match
those generated by live interval computation. This creates problems with
other code that assumes subranges precisely cover all subregisters
defined, e.g. shrinkToUses().
2024-03-11 15:24:17 +09:00
Carl Ritson
d9e6aa7048
[AMDGPU] Update LiveInterval def index for early-clobber (#79285)
On converting an instruction to an early-clobber definition in
convertToThreeAddress, we must also update live intervals for the
register to start at the early-clobber index.
2024-03-11 14:54:11 +09:00
Jay Foad
fd3eaf76ba
[GISel] Enforce G_PTR_ADD RHS type matching index size for addr space (#84352) 2024-03-09 09:07:22 +00:00
Shilei Tian
e963d0740e
[AMDGPU] Replace isInlinableLiteral16 with specific version (#84402)
The current implementation of `isInlinableLiteral16` assumes, a 16-bit
inlinable
literal is either an `i16` or a `fp16`. This is not always true because
of
`bf16`. However, we can't tell `fp16` and `bf16` apart by just looking
at the
value. This patch splits `isInlinableLiteral16` into three versions,
`i16`,
`fp16`, `bf16` respectively, and call the corresponding version.
2024-03-08 14:49:52 -05:00
Pierre van Houtryve
4b1910b11d
[GlobalISel][AMDGPU] Import patterns with multiple defs (#84171)
Fixes #63216
2024-03-08 09:39:10 +01:00
Fangrui Song
66bd3cd75b [AMDGPU,test] Change llc -march= to -mtriple=
PR #75982 had been created before these tests were added, therefore
some test were not updated.
2024-03-07 19:09:18 -08:00
David Green
44be5a7fdc
[Codegen] Make Width in getMemOperandsWithOffsetWidth a LocationSize. (#83875)
This is another part of #70452 which makes getMemOperandsWithOffsetWidth
use a LocationSize for Width, as opposed to the unsigned it currently
uses. The advantages on it's own are not super high if
getMemOperandsWithOffsetWidth usually uses known sizes, but if the
values can come from an MMO it can help be more accurate in case they
are Unknown (and in the future, scalable).
2024-03-06 17:40:13 +00:00
Krzysztof Drewniak
6540f1635a
[AMDGPU] Add IR-level pass to rewrite away address space 7 (#77952)
This commit adds the -lower-buffer-fat-pointers pass, which is
applicable to all AMDGCN compilations.

The purpose of this pass is to remove the type `ptr addrspace(7)` from
incoming IR. This must be done at the LLVM IR level because `ptr
addrspace(7)`, as a 160-bit primitive type, cannot be correctly handled
by SelectionDAG.

The detailed operation of the pass is described in comments, but, in
summary, the removal proceeds by:
1. Rewriting loads and stores of ptr addrspace(7) to loads and stores of
i160 (including vectors and aggregates). This is needed because the
in-register representation of these pointers will stop matching their
in-memory representation in step 2, and so ptrtoint/inttoptr operations
are used to preserve the expected memory layout

2. Mutating the IR to replace all occurrences of `ptr addrspace(7)` with
the type `{ptr addrspace(8), ptr addrspace(6) }`, which makes the two
parts of a buffer fat pointer (the 128-bit address space 8 resource and
the 32-bit address space 6 offset) visible in the IR. This also impacts
the argument and return types of functions.

3. *Splitting* the resource and offset parts. All instructions that
produce or consume buffer fat pointers (like GEP or load) are rewritten
to produce or consume the resource and offset parts separately. For
example, GEP updates the offset part of the result and a load uses the
resource and offset parts to populate the relevant
llvm.amdgcn.raw.ptr.buffer.load intrinsic call.

At the end of this process, the original mutated instructions are
replaced by their new split counterparts, ensuring no invalidly-typed IR
escapes this pass. (For operations like call, where the struct form is
needed, insertelement operations are inserted).

Compared to LGC's PatchBufferOp (

32cda89776/lgc/patch/PatchBufferOp.cpp
): this pass
- Also handles vectors of ptr addrspace(7)s
- Also handles function boundaries
- Includes the same uniform buffer optimization for loops and
conditionals
- Does *not* handle memcpy() and friends (this is future work)
- Does *not* break up large loads and stores into smaller parts. This
should be handled by extending the legalization
of *.buffer.{load,store} to handle larger types by producing multiple
instructions (the same way ordinary LOAD and STORE are legalized). That
work is planned for a followup commit.
- Does *not* have special logic for handling divergent buffer
descriptors. The logic in LGC is, as far as I can tell, incorrect in
general, and, per discussions with @nhaehnle, isn't widely used.
Therefore, divergent descriptors are handled with waterfall loops later
in legalization.

As a final matter, this commit updates atomic expansion to treat buffer
operations analogously to global ones.

(One question for reviewers: is the new pass is the right place? Should
it be later in the pipeline?)

Differential Revision: https://reviews.llvm.org/D158463
2024-03-06 09:49:58 -06:00
Mirko Brkušanin
1fd1f4c0e1
[AMDGPU] Handle amdgpu.last.use metadata (#83816)
Convert !amdgpu.last.use metadata into MachineMemOperand for last use
and handle it in SIMemoryLegalizer similar to nontemporal and volatile.
2024-03-06 16:33:52 +01:00
Emma Pilkington
4490003a22
[AMDGPU] Rename COV module flag to amdhsa_code_object_version (#79905)
The previous name 'amdgpu_code_object_version', was misleading since
this is really a property of the HSA OS. The new spelling also matches
the asm directive I added in bc82cfb.
2024-03-06 09:51:48 -05:00
Joseph Huber
1fc5e50ceb
[AMDGPU] Implement 'llvm.get.fpenv' and 'llvm.set.fpenv' (#83906)
Summary:
This patch implements the LLVM floating point environment control
intrinsics and also exposes it through clang. We encode the floating
point environment as a 64-bit value that simply concatenates the values
of the mode registers and the current trap status. We only fetch the
bits relevant for floating point instructions. That is, rounding mode,
denormalization mode, ieee, dx10 clamp, debug, enabled traps, f16
overflow, and active exceptions.
2024-03-06 08:11:54 -06:00
Shilei Tian
e9c1dbb408 Revert "[AMDGPU] Replace isInlinableLiteral16 with specific version (#81345)"
This reverts commit 530f0e64ec11327879c44f2fd55c7c28efdbaa2d because it breaks
downstream.
2024-03-06 08:42:54 -05:00
Pierre van Houtryve
52d5b8e02d
[AMDGPU] Don't form sext/abs/neg fp8 cvt (#83843)
gfx940 does not allow abs/sext/neg on v_cvt_fp8/bf8 & pk variants.

Fixes SWDEV-447468
2024-03-06 10:38:20 +01:00
Sameer Sahasrabuddhe
60822637bf Restore "Implement convergence control in MIR using SelectionDAG (#71785)"
This restores commit c7fdd8c11e54585dc9d15d63de9742067e0506b9.
Previously reverted in f010b1bef4dda2c7082cbb41dbabf1f149cce306.

LLVM function calls carry convergence control tokens as operand bundles, where
the tokens themselves are produced by convergence control intrinsics. This patch
implements convergence control tokens in MIR as follows:

1. Introduce target-independent ISD opcodes and MIR opcodes for convergence
   control intrinsics.
2. Model token values as untyped virtual registers in MIR.

The change also introduces an additional ISD opcode CONVERGENCECTRL_GLUE and a
corresponding machine opcode with the same spelling. This glues the convergence
control token to SDNodes that represent calls to intrinsics. The glued token is
later translated to an implicit argument in the MIR.

The lowering of calls to user-defined functions is target-specific. On AMDGPU,
the convergence control operand bundle at a non-intrinsic call is translated to
an explicit argument to the SI_CALL_ISEL instruction. Post-selection adjustment
converts this explicit argument to an implicit argument on the SI_CALL
instruction.
2024-03-06 12:19:32 +05:30
Noah Goldstein
17162b61c2 [KnownBits] Make nuw and nsw support in computeForAddSub optimal
Just some improvements that should hopefully strengthen analysis.

Closes #83580
2024-03-05 12:59:58 -06:00
bcahoon
4cf8b298cf
[AMDGPU][PromoteAlloca] Correctly handle a variable vector index (#83597)
The promote alloca to vector transformation assumes that the
vector index is a constant value. If it is not a constant, then
either an assert occurs or the tranformation generates an
incorrect index.
2024-03-05 08:18:17 -06:00
Mitch Phillips
f010b1bef4 Revert "Restore "Implement convergence control in MIR using SelectionDAG (#71785)""
This reverts commit c7fdd8c11e54585dc9d15d63de9742067e0506b9.

Reason: Broke the sanitizer buildbots. See the comments at
https://github.com/llvm/llvm-project/pull/71785
for more information.
2024-03-04 17:05:34 +01:00
Mirko Brkušanin
27ce5121ee
[AMDGPU] Fix setting nontemporal in memory legalizer (#83815)
Iterator MI can advance in insertWait() but we need original instruction
to set temporal hint. Just move it before handling volatile.
2024-03-04 15:05:31 +01:00
Shilei Tian
530f0e64ec
[AMDGPU] Replace isInlinableLiteral16 with specific version (#81345) 2024-03-04 08:40:42 -05:00
Mirko Brkušanin
982e9022ca
[AMDGPU] Add GFX12 memory legalizer tests (#83814) 2024-03-04 11:22:04 +01:00
Sameer Sahasrabuddhe
c7fdd8c11e Restore "Implement convergence control in MIR using SelectionDAG (#71785)"
Original commit 79889734b940356ab3381423c93ae06f22e772c9.
Perviously reverted in commit a2afcd5721869d1d03c8146bae3885b3385ba15e.

LLVM function calls carry convergence control tokens as operand bundles, where
the tokens themselves are produced by convergence control intrinsics. This patch
implements convergence control tokens in MIR as follows:

1. Introduce target-independent ISD opcodes and MIR opcodes for convergence
   control intrinsics.
2. Model token values as untyped virtual registers in MIR.

The change also introduces an additional ISD opcode CONVERGENCECTRL_GLUE and a
corresponding machine opcode with the same spelling. This glues the convergence
control token to SDNodes that represent calls to intrinsics. The glued token is
later translated to an implicit argument in the MIR.

The lowering of calls to user-defined functions is target-specific. On AMDGPU,
the convergence control operand bundle at a non-intrinsic call is translated to
an explicit argument to the SI_CALL_ISEL instruction. Post-selection adjustment
converts this explicit argument to an implicit argument on the SI_CALL
instruction.
2024-03-04 13:28:04 +05:30
Bjorn Pettersson
da591d390e [GlobalISel][TableGen] Take first result for multi-output instructions (#81130)
Previously, tblgen would reject patterns where one of its nested
instructions produced more than one result. These arise when the
instruction definition contains 'outs' as well as 'Defs'. This patch
fixes that by always taking the first result, which is how these
situations are handled in SelectionIDAG.

Original patch: https://reviews.llvm.org/D86617
Continued as: https://github.com/llvm/llvm-project/pull/81130
2024-03-02 20:10:02 +01:00
Pierre van Houtryve
756166e342
[AMDGPU] Improve detection of non-null addrspacecast operands (#82311)
Use IR analysis to infer when an addrspacecast operand is nonnull, then
lower it to an intrinsic that the DAG can use to skip the null check.

I did this using an intrinsic as it's non-intrusive. An alternative
would have been to allow something like `!nonnull` on `addrspacecast`
then lower that to a custom opcode (or add an operand to the
addrspacecast MIR/DAG opcodes), but it's a lot of boilerplate for just
one target's use case IMO.

I'm hoping that when we switch to GISel that we can move all this logic
to the MIR level without losing info, but currently the DAG doesn't see
enough so we need to act in CGP.

Fixes: SWDEV-316445
2024-03-01 14:01:10 +01:00
Nick Anderson
ba8e9ace13
[AMDGPU] promote i1 arg type for amdgpu_cs (#82971)
fixes #68087 
Not sure where to put regression tests for this pr? Also, should i1 args
not in reg also be promoted?
2024-03-01 14:25:46 +05:30
Leon Clark
5b07fd4799
[AMDGPU] Fix OpenCL conformance test failures for ctlz. (#83170)
Remove LSH transform and restore previous lowering.

Fixes conformance issue in
[77615](https://github.com/llvm/llvm-project/pull/77615) where OpenCL
integer_ops tests fail for integer_clz.

Co-authored-by: Leon Clark <leoclark@amd.com>
2024-02-29 22:28:13 +00:00
Petar Avramovic
0d572c41f9
AMDGPU\GlobalISel: remove amdgpu-global-isel-risky-select flag (#83426)
AMDGPUInstructionSelector should no longer attempt to select S1 G_PHIs.
Remove MIR test that attempts to inst-select divergent vcc(S1) G_PHI.
Lane mask merging algorithm for GlobalISel is now responsible for
selecting divergent S1 G_PHIs in AMDGPUGlobalISelDivergenceLowering.
Uniform S1 G_PHIs should be lowered to S32 G_PHIs in reg bank select
pass. In summary S1 G_PHIs should not reach AMDGPUInstructionSelector.
2024-02-29 15:38:54 +01:00
Petar Avramovic
6c2eec5cea
AMDGPU/GlobalISel: lane masks merging (#73337)
Basic implementation of lane mask merging for GlobalISel.
Lane masks on GlobalISel are registers with sgpr register class
and S1 LLT - required by machine uniformity analysis.
Implements equivalent of lowerPhis from SILowerI1Copies.cpp in:
patch 1: https://github.com/llvm/llvm-project/pull/75340
patch 2: https://github.com/llvm/llvm-project/pull/75349
patch 3: https://github.com/llvm/llvm-project/pull/80003
patch 4: https://github.com/llvm/llvm-project/pull/78431
patch 5: is in this commit:

AMDGPU/GlobalISelDivergenceLowering: constrain incoming registers

Previously, in PHIs that represent lane masks, incoming registers
taken as-is were not selected as lane masks. Such registers are not
being merged with another lane mask and most often only have S1 LLT.
Implement constrainAsLaneMask by constraining incoming registers
taken as-is with lane mask attributes, essentially transforming them
to lane masks. This is final step in having PHI instructions created
in this pass to be fully instruction-selected.
2024-02-29 13:57:59 +01:00
Matt Arsenault
6cfd3439d4
APFloat: Fix signed zero handling in minnum/maxnum (#83376)
Follow the 2019 rules and order -0 as less than +0 and +0 as greater
than -0. As currently defined this isn't required for the intrinsics,
but is a better QoI.

This will avoid the workaround in libc added by #83158
2024-02-29 16:51:33 +05:30
Shilei Tian
191fd2d9db
[NFC][AMDGPU] Move the rem tests in div_i128.ll into rem_i128.ll (#83307) 2024-02-28 18:47:02 -05:00
Petar Avramovic
3e35ba53e2
AMDGPU/GFX12: Insert waitcnts before stores with scope_sys (#82996)
Insert waitcnts for loads and atomics before stores with system scope.
Scope is field in instruction encoding and corresponds to desired
coherence level in cache hierarchy.
Intrinsic stores can set scope in cache policy operand.
If volatile keyword is used on generic stores memory legalizer will set
scope to system. Generic stores, by default, get lowest scope level.
Waitcnts are not required if it is guaranteed that memory is cached.
For example vulkan shaders can guarantee this.
TODO: implement flag for frontends to give us a hint not to insert
waits.
Expecting vulkan flag to be implemented as vulkan:private MMRA.
2024-02-28 16:18:04 +01:00
Valery Pykhtin
a845ea3878
[AMDGPU] Fix SDWA 'preserve' transformation for instructions in different basic blocks. (#82406)
This fixes crash when operand sources for V_OR instruction reside in
different basic blocks.
2024-02-28 14:47:33 +01:00
Jeffrey Byrnes
cf1c97b2d2
[AMDGPU] Do not attempt to fallback to default mutations (#83208)
IGLP itself will be in SavedMutations via mutations added during
Scheduler creation, thus falling back results in reapplying IGLP.

In PostRA scheduling, if we have multiple regions with IGLP
instructions, then we may have infinite loop.

Disable the feature for now.
2024-02-27 18:04:59 -08:00
choikwa
04db60d150
[AMDGPU] Prevent hang in SIFoldOperands by caching uses (#82099)
foldOperands() for REG_SEQUENCE has recursion that can trigger an infinite loop
as the method can modify the operand order, which messes up the range-based
for loop. This patch fixes the issue by caching the uses for processing beforehand,
and then iterating over the cache rather using the instruction iterator.
2024-02-27 09:13:59 -06:00
Matt Arsenault
ca66f7469f AMDGPU: Merge tests for llvm.amdgcn.dispatch.id 2024-02-27 18:42:40 +05:30
Matt Arsenault
2e4643a53e AMDGPU: Regenerate baseline test checks 2024-02-27 18:42:40 +05:30
michaelselehov
56ad6d1939
[MachineLICM] Hoist COPY instruction only when user can be hoisted (#81735)
befa925acac8fd6a9266e introduced preliminary hoisting of COPY
instructions when the user of the COPY is inside the same loop. That
optimization appeared to be too aggressive and hoisted too many COPY's
greatly increasing register pressure causing performance regressions for
AMDGPU target.

This is intended to fix the regression by hoisting COPY instruction only
if either:
 - User of COPY can be hoisted (other args are invariant) 
 or
 - Hoisting COPY doesn't bring high register pressure
2024-02-27 12:31:29 +00:00
Matt Arsenault
e7900e695e AMDGPU: Regenerate baseline mir tests 2024-02-27 10:44:53 +05:30
Noah Goldstein
15a7de697a [SelectionDAG] Support sign tracking through {S|U}INT_TO_FP
Just a minimal amount of easily provable tracking.

Proofs: https://alive2.llvm.org/ce/z/RQYbdw

Closes #82808

Alive2 to has an issue with `(sitofp i1)`, but it can
be verified by hand: https://godbolt.org/z/qKr7hT7s9
2024-02-26 15:35:38 -06:00
Jeffrey Byrnes
113052b2b0 [AMDGPU] Prefer lower total register usage in regions with spilling
Change-Id: Ia5c434b0945bdcbc357c5e06c3164118fc91df25
2024-02-26 12:19:52 -08:00
Petar Avramovic
433f8e741e
MachineSSAUpdater: use all vreg attributes instead of reg class only (#78431)
When initializing MachineSSAUpdater save all attributes of current
virtual register and create new virtual registers with same attributes.
Now new virtual registers have same both register class or bank and LLT.
Previously new virtual registers had same register class but LLT was not
set (LLT was set to default/empty LLT).
Required by GlobalISel for AMDGPU, new 'lane mask' virtual registers
created by MachineSSAUpdater need to have both register class and LLT.

patch 4 from: https://github.com/llvm/llvm-project/pull/73337
2024-02-26 13:46:13 +01:00
Jack Styles
28233408a2
[CodeGen] [ARM] Make RISC-V Init Undef Pass Target Independent and add support for the ARM Architecture. (#77770)
When using Greedy Register Allocation, there are times where
early-clobber values are ignored, and assigned the same register. This
is illeagal behaviour for these intructions. To get around this, using
Pseudo instructions for early-clobber registers gives them a definition
and allows Greedy to assign them to a different register. This then
meets the ARM Architecture Reference Manual and matches the defined
behaviour.

This patch takes the existing RISC-V patch and makes it target
independent, then adds support for the ARM Architecture. Doing this will
ensure early-clobber restraints are followed when using the ARM
Architecture. Making the pass target independent will also open up
possibility that support other architectures can be added in the future.
2024-02-26 12:12:31 +00:00
Rishabh Bali
fe42e72db2
[CodeGen] Port AtomicExpand to new Pass Manager (#71220)
Port the `atomicexpand` pass to the new Pass Manager. 
Fixes #64559
2024-02-25 18:42:22 +05:30
Jeffrey Byrnes
8f2bd8ae68
[AMDGPU] Introduce iglp_opt(2): Generalized exp/mfma interleaving for select kernels (#81342)
This implements the basic pipelining structure of exp/mfma interleaving
for better extensibility. While it does have improved extensibility,
there are controls which only enable it for DAGs with certain
characteristics (matching the DAGs it has been designed against).
2024-02-23 17:13:20 -08:00
Pierre van Houtryve
4235e44d4c
[GlobalISel] Constant-fold G_PTR_ADD with different type sizes (#81473)
All other opcodes in the list are constrained to have the same type on
both operands, but not G_PTR_ADD.

Fixes  #81464
2024-02-22 13:15:26 +01:00
Nick Anderson
8bd327d6fe
[AMDGPU][GlobalISel] Add fdiv / sqrt to rsq combine (#78673)
Fixes #64743
2024-02-22 09:47:36 +01:00