7260 Commits

Author SHA1 Message Date
Thorsten Schütt
deefe3fbc9
[GlobalIsel] Post-review combine ADDO (#85961)
https://github.com/llvm/llvm-project/pull/82927
2024-03-21 03:56:40 +01:00
Jonas Paulsson
9ebd329ad8 Revert "Move assertion for AdjustsStack from PEI to MachineVerifier. (#85698)"
This reverts commit 05bde30585710a51592eee0a6cf6df8184d09c92.

Reverting due to verifier complaints with expensive checks on build-bot.
2024-03-20 11:48:30 -04:00
Jonas Paulsson
05bde30585
Move assertion for AdjustsStack from PEI to MachineVerifier. (#85698)
Have the verifier report a missing AdjustsStack flag rather than waiting until
PEI asserts.
2024-03-20 10:29:12 -04:00
Pravin Jagtap
e52a687871
[AMDGPU][NFC] Test clean up (#85922)
Authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-03-20 17:29:42 +05:30
Pravin Jagtap
070d1e8321
[AMDGPU] Add test for fpext & fptrunc with bf16. (#85909)
Authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-03-20 14:45:38 +05:30
Peter Rong
4a026b5092
[AMDGCN] Use ZExt when handling indices in insertment element (#85718)
When i1 true is used as an index, SExt extends it to i32 -1. This would
cause BitVector to overflow.
The language manual have specified that the index shall be treated as an
unsigned number, this patch fixes that.
(https://llvm.org/docs/LangRef.html#insertelement-instruction)

This patch fixes #85717

---------

Signed-off-by: Peter Rong <PeterRong96@gmail.com>
2024-03-19 21:44:08 -07:00
Changpeng Fang
ab76052fa9
AMDGPU: Treat SWMMAC the same as MFMA and other WMMA for sched_barrier (#85721) 2024-03-19 09:58:09 -07:00
Pravin Jagtap
08701e35ed
[AMDGPU][NFC] Test clean up. (#85775)
Added common check for DPP and Iterative strategies for uniform value
case since optimization applied is same.

Authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-03-19 18:00:34 +05:30
Pierre van Houtryve
953c13b5c9
[AMDGPU][PromoteAlloca] Whole-function alloca promotion to vector (#84735)
Update PromoteAllocaToVector so it considers the whole function before promoting allocas.
Allocas are scored & sorted so the highest value ones are seen first. The budget is now per function instead of per alloca.

Passed internal performance testing.
2024-03-19 11:49:22 +01:00
Jonas Paulsson
09bc6abba6
[MachineFrameInfo] Refactoring around computeMaxcallFrameSize() (NFC) (#78001)
- Use computeMaxCallFrameSize() in PEI::calculateCallFrameInfo() instead of duplicating the code.

- Set AdjustsStack in FinalizeISel instead of in computeMaxCallFrameSize().
2024-03-18 10:37:59 -04:00
Yingwei Zheng
38a44bdc93
[CodeGenPrepare] Reverse the canonicalization of isInf/isNanOrInf (#81572)
In commit
2b582440c1,
we canonicalize the isInf/isNanOrInf idiom into fabs+fcmp for better
analysis/codegen (See also the discussion in
https://github.com/llvm/llvm-project/pull/76338).

This patch reverses the fabs+fcmp to `is.fpclass`. If the `is.fpclass`
is not supported by the target, it will be expanded by TLI.

Fixes the regression introduced by
2b582440c1
and
https://github.com/llvm/llvm-project/pull/80414#issuecomment-1936374206.
2024-03-18 18:27:45 +08:00
pvanhout
3493438605 Revert "[AMDGPU] Run LowerLDS at the end of the fullLTO pipeline (#75333)"
This reverts commit 9b98692eedb78aa106539c36ba02944f32cae1ff.
2024-03-18 11:18:57 +01:00
Pierre van Houtryve
9b98692eed
[AMDGPU] Run LowerLDS at the end of the fullLTO pipeline (#75333)
This change allows us to use `--lto-partitions` in some cases (not at
all guaranteed it works perfectly), as LDS is lowered before the module
is split for parallel codegen.

We must run LowerLDS before splitting modules as it needs to see all
callers of functions with LDS to properly lower them.
2024-03-18 09:09:43 +01:00
Sameer Sahasrabuddhe
ec34699f75
[GlobalISel] convergence control tokens and intrinsics (#67006)
[GlobalISel] Implement convergence control tokens and intrinsics in GMIR

In the IR translator, convert the LLVM token type to LLT::token(), which is an
alias for the s0 type. These show up as implicit uses on convergent operations.

Differential Revision: https://reviews.llvm.org/D158147
2024-03-18 10:34:11 +05:30
Jay Foad
092999e70b [AMDGPU] Update checks in new test after #85370 2024-03-15 14:10:30 +00:00
Matt Arsenault
9b5d9a81bd AMDGPU: Regenerate test checks from c7c561ef9
The test output changed after initial commit/test in
5f774619eac5db73398225a4c924a9c1d437fb40
2024-03-15 16:27:42 +05:30
Matt Arsenault
c7c561ef98 AMDGPU: Enable ExpandLargeFpConvert for > 64-bit types
Fixes casts between double/float/half and i128. The pass seems to be
broken for bfloat though. I also believe we could have a better implementation
which attempts to make use the native 32-bit conversion instructions like
the 64-bit expansion does.
2024-03-15 16:08:39 +05:30
Thorsten Schütt
5f774619ea
[GlobalIsel] Combine ADDO (#82927)
Perform the requested arithmetic and produce a carry output in addition
to the normal result.

Clang has them as builtins (__builtin_add_overflow_p). The middle end
has intrinsics for them (sadd_with_overflow).

AArch64: ADDS Add and set flags

On Neoverse V2, they run at half the throughput of basic arithmetic and
have a limited set of pipelines.
2024-03-14 12:45:19 +01:00
Carl Ritson
c29b265eb9 Reapply "[AMDGPU] Add pal metadata 3.0 support to callable pal funcs (#67104)"
This reverts commit 7d508eb5d38f4bbbab4230a666d9e742e271af61.
2024-03-14 10:56:43 +09:00
Harald van Dijk
ceb744eb2f
[AMDGPU] Fix canonicalization of truncated values. (#83054)
We were relying on roundings to implicitly canonicalize, which is
generally safe, except with roundings that may be optimized away.

Fixes #82937.
2024-03-13 12:08:39 +00:00
Jun Wang
c4e517f59c
[AMDGPU] Adding the amdgpu_num_work_groups function attribute (#79035)
A new function attribute named amdgpu_num_work_groups is added. This
attribute, which consists of three integers, allows programmers to let
the compiler know the number of workgroups to be launched in each of the
three dimensions and do optimizations based on that information.

---------

Co-authored-by: Jun Wang <jun.wang7@amd.com>
2024-03-12 10:30:39 -07:00
Matt Arsenault
bd72ebd8d1
AMDGPU: Add some more mfma hazard recognizer tests (#84727) 2024-03-12 22:05:47 +05:30
Jake Egan
fa1d13590c [AIX][tests] Disable failing tests on AIX
These new tests are failing on the AIX bot because the -I option isn't supported.

Disable these tests for now until they can be fixed.
2024-03-12 12:11:18 -04:00
Pierre van Houtryve
d4569d42b5
[AMDGPU] Let LowerModuleLDS run twice on the same module (#81729)
If all variables in the module are absolute, this means we're running
the pass again on an already lowered module, and that works.
If none of them are absolute, lowering can proceed as usual.
Only diagnose cases where we have a mix of absolute/non-absolute GVs,
which means we added LDS GVs after lowering, which is broken.

See #81491
Split from #75333
2024-03-11 09:20:01 +01:00
AtariDreams
4e0e9b17c6
[SelectionDAG] Switch to LiveRegUnits (#84197) 2024-03-11 12:47:39 +05:30
Carl Ritson
4a21e3afa2
[LiveIntervals] repairIntervalsInRange: recompute width changes (#78564)
Extend repairIntervalsInRange to completely recompute the interva for a
register if subregister defs exist without precise subrange matches
(LaneMask exactly matching subregister).
This occurs when register sequences are lowered to copies such that the
size of the copies do not match any uses of the subregisters formed
(i.e. during twoaddressinstruction).

The subranges without this change are probably legal, but do not match
those generated by live interval computation. This creates problems with
other code that assumes subranges precisely cover all subregisters
defined, e.g. shrinkToUses().
2024-03-11 15:24:17 +09:00
Carl Ritson
d9e6aa7048
[AMDGPU] Update LiveInterval def index for early-clobber (#79285)
On converting an instruction to an early-clobber definition in
convertToThreeAddress, we must also update live intervals for the
register to start at the early-clobber index.
2024-03-11 14:54:11 +09:00
Jay Foad
fd3eaf76ba
[GISel] Enforce G_PTR_ADD RHS type matching index size for addr space (#84352) 2024-03-09 09:07:22 +00:00
Shilei Tian
e963d0740e
[AMDGPU] Replace isInlinableLiteral16 with specific version (#84402)
The current implementation of `isInlinableLiteral16` assumes, a 16-bit
inlinable
literal is either an `i16` or a `fp16`. This is not always true because
of
`bf16`. However, we can't tell `fp16` and `bf16` apart by just looking
at the
value. This patch splits `isInlinableLiteral16` into three versions,
`i16`,
`fp16`, `bf16` respectively, and call the corresponding version.
2024-03-08 14:49:52 -05:00
Pierre van Houtryve
4b1910b11d
[GlobalISel][AMDGPU] Import patterns with multiple defs (#84171)
Fixes #63216
2024-03-08 09:39:10 +01:00
Fangrui Song
66bd3cd75b [AMDGPU,test] Change llc -march= to -mtriple=
PR #75982 had been created before these tests were added, therefore
some test were not updated.
2024-03-07 19:09:18 -08:00
David Green
44be5a7fdc
[Codegen] Make Width in getMemOperandsWithOffsetWidth a LocationSize. (#83875)
This is another part of #70452 which makes getMemOperandsWithOffsetWidth
use a LocationSize for Width, as opposed to the unsigned it currently
uses. The advantages on it's own are not super high if
getMemOperandsWithOffsetWidth usually uses known sizes, but if the
values can come from an MMO it can help be more accurate in case they
are Unknown (and in the future, scalable).
2024-03-06 17:40:13 +00:00
Krzysztof Drewniak
6540f1635a
[AMDGPU] Add IR-level pass to rewrite away address space 7 (#77952)
This commit adds the -lower-buffer-fat-pointers pass, which is
applicable to all AMDGCN compilations.

The purpose of this pass is to remove the type `ptr addrspace(7)` from
incoming IR. This must be done at the LLVM IR level because `ptr
addrspace(7)`, as a 160-bit primitive type, cannot be correctly handled
by SelectionDAG.

The detailed operation of the pass is described in comments, but, in
summary, the removal proceeds by:
1. Rewriting loads and stores of ptr addrspace(7) to loads and stores of
i160 (including vectors and aggregates). This is needed because the
in-register representation of these pointers will stop matching their
in-memory representation in step 2, and so ptrtoint/inttoptr operations
are used to preserve the expected memory layout

2. Mutating the IR to replace all occurrences of `ptr addrspace(7)` with
the type `{ptr addrspace(8), ptr addrspace(6) }`, which makes the two
parts of a buffer fat pointer (the 128-bit address space 8 resource and
the 32-bit address space 6 offset) visible in the IR. This also impacts
the argument and return types of functions.

3. *Splitting* the resource and offset parts. All instructions that
produce or consume buffer fat pointers (like GEP or load) are rewritten
to produce or consume the resource and offset parts separately. For
example, GEP updates the offset part of the result and a load uses the
resource and offset parts to populate the relevant
llvm.amdgcn.raw.ptr.buffer.load intrinsic call.

At the end of this process, the original mutated instructions are
replaced by their new split counterparts, ensuring no invalidly-typed IR
escapes this pass. (For operations like call, where the struct form is
needed, insertelement operations are inserted).

Compared to LGC's PatchBufferOp (

32cda89776/lgc/patch/PatchBufferOp.cpp
): this pass
- Also handles vectors of ptr addrspace(7)s
- Also handles function boundaries
- Includes the same uniform buffer optimization for loops and
conditionals
- Does *not* handle memcpy() and friends (this is future work)
- Does *not* break up large loads and stores into smaller parts. This
should be handled by extending the legalization
of *.buffer.{load,store} to handle larger types by producing multiple
instructions (the same way ordinary LOAD and STORE are legalized). That
work is planned for a followup commit.
- Does *not* have special logic for handling divergent buffer
descriptors. The logic in LGC is, as far as I can tell, incorrect in
general, and, per discussions with @nhaehnle, isn't widely used.
Therefore, divergent descriptors are handled with waterfall loops later
in legalization.

As a final matter, this commit updates atomic expansion to treat buffer
operations analogously to global ones.

(One question for reviewers: is the new pass is the right place? Should
it be later in the pipeline?)

Differential Revision: https://reviews.llvm.org/D158463
2024-03-06 09:49:58 -06:00
Mirko Brkušanin
1fd1f4c0e1
[AMDGPU] Handle amdgpu.last.use metadata (#83816)
Convert !amdgpu.last.use metadata into MachineMemOperand for last use
and handle it in SIMemoryLegalizer similar to nontemporal and volatile.
2024-03-06 16:33:52 +01:00
Emma Pilkington
4490003a22
[AMDGPU] Rename COV module flag to amdhsa_code_object_version (#79905)
The previous name 'amdgpu_code_object_version', was misleading since
this is really a property of the HSA OS. The new spelling also matches
the asm directive I added in bc82cfb.
2024-03-06 09:51:48 -05:00
Joseph Huber
1fc5e50ceb
[AMDGPU] Implement 'llvm.get.fpenv' and 'llvm.set.fpenv' (#83906)
Summary:
This patch implements the LLVM floating point environment control
intrinsics and also exposes it through clang. We encode the floating
point environment as a 64-bit value that simply concatenates the values
of the mode registers and the current trap status. We only fetch the
bits relevant for floating point instructions. That is, rounding mode,
denormalization mode, ieee, dx10 clamp, debug, enabled traps, f16
overflow, and active exceptions.
2024-03-06 08:11:54 -06:00
Shilei Tian
e9c1dbb408 Revert "[AMDGPU] Replace isInlinableLiteral16 with specific version (#81345)"
This reverts commit 530f0e64ec11327879c44f2fd55c7c28efdbaa2d because it breaks
downstream.
2024-03-06 08:42:54 -05:00
Pierre van Houtryve
52d5b8e02d
[AMDGPU] Don't form sext/abs/neg fp8 cvt (#83843)
gfx940 does not allow abs/sext/neg on v_cvt_fp8/bf8 & pk variants.

Fixes SWDEV-447468
2024-03-06 10:38:20 +01:00
Sameer Sahasrabuddhe
60822637bf Restore "Implement convergence control in MIR using SelectionDAG (#71785)"
This restores commit c7fdd8c11e54585dc9d15d63de9742067e0506b9.
Previously reverted in f010b1bef4dda2c7082cbb41dbabf1f149cce306.

LLVM function calls carry convergence control tokens as operand bundles, where
the tokens themselves are produced by convergence control intrinsics. This patch
implements convergence control tokens in MIR as follows:

1. Introduce target-independent ISD opcodes and MIR opcodes for convergence
   control intrinsics.
2. Model token values as untyped virtual registers in MIR.

The change also introduces an additional ISD opcode CONVERGENCECTRL_GLUE and a
corresponding machine opcode with the same spelling. This glues the convergence
control token to SDNodes that represent calls to intrinsics. The glued token is
later translated to an implicit argument in the MIR.

The lowering of calls to user-defined functions is target-specific. On AMDGPU,
the convergence control operand bundle at a non-intrinsic call is translated to
an explicit argument to the SI_CALL_ISEL instruction. Post-selection adjustment
converts this explicit argument to an implicit argument on the SI_CALL
instruction.
2024-03-06 12:19:32 +05:30
Noah Goldstein
17162b61c2 [KnownBits] Make nuw and nsw support in computeForAddSub optimal
Just some improvements that should hopefully strengthen analysis.

Closes #83580
2024-03-05 12:59:58 -06:00
bcahoon
4cf8b298cf
[AMDGPU][PromoteAlloca] Correctly handle a variable vector index (#83597)
The promote alloca to vector transformation assumes that the
vector index is a constant value. If it is not a constant, then
either an assert occurs or the tranformation generates an
incorrect index.
2024-03-05 08:18:17 -06:00
Mitch Phillips
f010b1bef4 Revert "Restore "Implement convergence control in MIR using SelectionDAG (#71785)""
This reverts commit c7fdd8c11e54585dc9d15d63de9742067e0506b9.

Reason: Broke the sanitizer buildbots. See the comments at
https://github.com/llvm/llvm-project/pull/71785
for more information.
2024-03-04 17:05:34 +01:00
Mirko Brkušanin
27ce5121ee
[AMDGPU] Fix setting nontemporal in memory legalizer (#83815)
Iterator MI can advance in insertWait() but we need original instruction
to set temporal hint. Just move it before handling volatile.
2024-03-04 15:05:31 +01:00
Shilei Tian
530f0e64ec
[AMDGPU] Replace isInlinableLiteral16 with specific version (#81345) 2024-03-04 08:40:42 -05:00
Mirko Brkušanin
982e9022ca
[AMDGPU] Add GFX12 memory legalizer tests (#83814) 2024-03-04 11:22:04 +01:00
Sameer Sahasrabuddhe
c7fdd8c11e Restore "Implement convergence control in MIR using SelectionDAG (#71785)"
Original commit 79889734b940356ab3381423c93ae06f22e772c9.
Perviously reverted in commit a2afcd5721869d1d03c8146bae3885b3385ba15e.

LLVM function calls carry convergence control tokens as operand bundles, where
the tokens themselves are produced by convergence control intrinsics. This patch
implements convergence control tokens in MIR as follows:

1. Introduce target-independent ISD opcodes and MIR opcodes for convergence
   control intrinsics.
2. Model token values as untyped virtual registers in MIR.

The change also introduces an additional ISD opcode CONVERGENCECTRL_GLUE and a
corresponding machine opcode with the same spelling. This glues the convergence
control token to SDNodes that represent calls to intrinsics. The glued token is
later translated to an implicit argument in the MIR.

The lowering of calls to user-defined functions is target-specific. On AMDGPU,
the convergence control operand bundle at a non-intrinsic call is translated to
an explicit argument to the SI_CALL_ISEL instruction. Post-selection adjustment
converts this explicit argument to an implicit argument on the SI_CALL
instruction.
2024-03-04 13:28:04 +05:30
Bjorn Pettersson
da591d390e [GlobalISel][TableGen] Take first result for multi-output instructions (#81130)
Previously, tblgen would reject patterns where one of its nested
instructions produced more than one result. These arise when the
instruction definition contains 'outs' as well as 'Defs'. This patch
fixes that by always taking the first result, which is how these
situations are handled in SelectionIDAG.

Original patch: https://reviews.llvm.org/D86617
Continued as: https://github.com/llvm/llvm-project/pull/81130
2024-03-02 20:10:02 +01:00
Pierre van Houtryve
756166e342
[AMDGPU] Improve detection of non-null addrspacecast operands (#82311)
Use IR analysis to infer when an addrspacecast operand is nonnull, then
lower it to an intrinsic that the DAG can use to skip the null check.

I did this using an intrinsic as it's non-intrusive. An alternative
would have been to allow something like `!nonnull` on `addrspacecast`
then lower that to a custom opcode (or add an operand to the
addrspacecast MIR/DAG opcodes), but it's a lot of boilerplate for just
one target's use case IMO.

I'm hoping that when we switch to GISel that we can move all this logic
to the MIR level without losing info, but currently the DAG doesn't see
enough so we need to act in CGP.

Fixes: SWDEV-316445
2024-03-01 14:01:10 +01:00
Nick Anderson
ba8e9ace13
[AMDGPU] promote i1 arg type for amdgpu_cs (#82971)
fixes #68087 
Not sure where to put regression tests for this pr? Also, should i1 args
not in reg also be promoted?
2024-03-01 14:25:46 +05:30
Leon Clark
5b07fd4799
[AMDGPU] Fix OpenCL conformance test failures for ctlz. (#83170)
Remove LSH transform and restore previous lowering.

Fixes conformance issue in
[77615](https://github.com/llvm/llvm-project/pull/77615) where OpenCL
integer_ops tests fail for integer_clz.

Co-authored-by: Leon Clark <leoclark@amd.com>
2024-02-29 22:28:13 +00:00