2067 Commits

Author SHA1 Message Date
David Green
ac321cbb03
[AArch64][GlobalISel] Legalize Insert vector element (#81453)
This attempts to standardize and extend some of the insert vector
element lowering. Most notably:
- More types are handled by splitting illegal vectors.
- The index type for G_INSERT_VECTOR_ELT is canonicalized to
  TLI.getVectorIdxTy(), similar to extact_vector_element.
- Some of the existing patterns now have the index type specified to
  make sure they can apply to GISel too.
- The C++ selection code has been removed, relying on tablegen patterns.
- G_INSERT_VECTOR_ELT with small GPR input elements are pre-selected to
  use a i32 type, allowing the existing patterns to apply.
- Variable index inserts are lowered in post-legalizer lowering,
  expanding into a stack store and reload.
2024-04-08 08:44:13 +01:00
Jay Foad
3cf539fb04
[AMDGPU] Combine or remove redundant waitcnts at the end of each MBB (#87539)
Call generateWaitcnt unconditionally at the end of
SIInsertWaitcnts::insertWaitcntInBlock. Even if we don't need to
generate a new waitcnt instruction it has the effect of combining or
removing redundant waitcnts that were already present. Tests show
various small improvements in waitcnt placement.
2024-04-04 10:14:16 +01:00
Ruiling, Song
216b5e9666
[AMDGPU] Expose RTZ version of f16 interpolation for gfx11+ (#86614) 2024-04-01 09:48:37 +08:00
Shilei Tian
3a106e5b2c
[GlobalISel] Fold G_ICMP if possible (#86357)
This patch tries to fold `G_ICMP` if possible.
2024-03-29 15:59:50 -04:00
Shilei Tian
661bb9daae
[GlobalISel] Handle div-by-pow2 (#83155)
This patch adds similar handling of div-by-pow2 as in `SelectionDAG`.
2024-03-29 12:41:47 -04:00
Thomas Symalla
256343a0e9
Revert "Update amdgpu_gfx functions to use s0-s3 for inreg SGPR arguments on targets using scratch instructions for stack #78226" (#86273)
Reverts llvm/llvm-project#81394

This reverts commit 3ac243bc0d7922d083af2cf025247b5698556062.
It is not handling RSrc registers s0-s3 correctly. This leads to a
broken test, where it expects s0-s3 as function argument and uses it as
RSrc register as well.
We need to re-visit the patch, but apparently we only want to have s0-s3
as
argument registers if we don't need them as RSrc registers.
2024-03-26 11:01:08 +01:00
David Green
4d315ff382
[GlobalISel] Add CTLZ known bits. (#86436)
Replicated from SDAG.
2024-03-26 09:11:35 +00:00
David Stuttard
75e528fdd9
[AMDGPU] Extend zero initialization of return values for TFE (#85759)
buffer_load instructions that use TFE also need to zero initialize
return values similar to how the image instructions currently work. Add
support for this with standard zero init of all results + zero init of
just TFE flag when enable-prt-strict-null subtarget feature is disabled.
2024-03-25 09:01:46 +00:00
Evgenii Kudriashov
d365a45cb3
[GlobalISel] Introduce G_TRAP, G_DEBUGTRAP, G_UBSANTRAP (#84941)
Here we introduce three new GMIR instructions to cover a set of trap
intrinsics. The idea behind it is that generic intrinsics shouldn't be
used with G_INTRINSIC opcode.

These new instructions can match perfectly with existing trap ISD nodes.
It allows X86, AArch64, RISCV and Mips to reuse SelectionDAG patterns for
selection and avoid manual selection. However AMDGPU is an exception. It
selects traps during legalization regardless SelectionDAG or GlobalISel.

Since there are not many places where traps are used, this change
attempts to clean up all the usages of G_INTRINSIC with trap intrinsics. So,
there is no stage when both G_TRAP and
G_INTRINSIC_W_SIDE_EFFECTS(@llvm.trap) are allowed.
2024-03-23 13:12:44 +01:00
Pravin Jagtap
e1a8120a63
[AMDGPU] Support double type in atomic optimizer. (#84307)
Presently the atomic optimizer supports only 32-bit operations. Plan is
to extend the atomic optimizer for 64-bit operations for compute and
graphics. This patch extends support for double type for `uniform
values` only. Going forward, will extend the support for divergent
values. Adding support for divergent values requires
extending/legalizing readfirstlane, readlane, writelane, etc ops for
64-bit operations to avoid `bitcast` noise that we have currently.

---------

Authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-03-22 09:25:06 +05:30
SahilPatidar
3ac243bc0d
Update amdgpu_gfx functions to use s0-s3 for inreg SGPR arguments on targets using scratch instructions for stack #78226 (#81394)
Resolve #78226
2024-03-21 16:52:08 +05:30
Thorsten Schütt
5f774619ea
[GlobalIsel] Combine ADDO (#82927)
Perform the requested arithmetic and produce a carry output in addition
to the normal result.

Clang has them as builtins (__builtin_add_overflow_p). The middle end
has intrinsics for them (sadd_with_overflow).

AArch64: ADDS Add and set flags

On Neoverse V2, they run at half the throughput of basic arithmetic and
have a limited set of pipelines.
2024-03-14 12:45:19 +01:00
Jay Foad
fd3eaf76ba
[GISel] Enforce G_PTR_ADD RHS type matching index size for addr space (#84352) 2024-03-09 09:07:22 +00:00
Pierre van Houtryve
4b1910b11d
[GlobalISel][AMDGPU] Import patterns with multiple defs (#84171)
Fixes #63216
2024-03-08 09:39:10 +01:00
Fangrui Song
66bd3cd75b [AMDGPU,test] Change llc -march= to -mtriple=
PR #75982 had been created before these tests were added, therefore
some test were not updated.
2024-03-07 19:09:18 -08:00
Krzysztof Drewniak
6540f1635a
[AMDGPU] Add IR-level pass to rewrite away address space 7 (#77952)
This commit adds the -lower-buffer-fat-pointers pass, which is
applicable to all AMDGCN compilations.

The purpose of this pass is to remove the type `ptr addrspace(7)` from
incoming IR. This must be done at the LLVM IR level because `ptr
addrspace(7)`, as a 160-bit primitive type, cannot be correctly handled
by SelectionDAG.

The detailed operation of the pass is described in comments, but, in
summary, the removal proceeds by:
1. Rewriting loads and stores of ptr addrspace(7) to loads and stores of
i160 (including vectors and aggregates). This is needed because the
in-register representation of these pointers will stop matching their
in-memory representation in step 2, and so ptrtoint/inttoptr operations
are used to preserve the expected memory layout

2. Mutating the IR to replace all occurrences of `ptr addrspace(7)` with
the type `{ptr addrspace(8), ptr addrspace(6) }`, which makes the two
parts of a buffer fat pointer (the 128-bit address space 8 resource and
the 32-bit address space 6 offset) visible in the IR. This also impacts
the argument and return types of functions.

3. *Splitting* the resource and offset parts. All instructions that
produce or consume buffer fat pointers (like GEP or load) are rewritten
to produce or consume the resource and offset parts separately. For
example, GEP updates the offset part of the result and a load uses the
resource and offset parts to populate the relevant
llvm.amdgcn.raw.ptr.buffer.load intrinsic call.

At the end of this process, the original mutated instructions are
replaced by their new split counterparts, ensuring no invalidly-typed IR
escapes this pass. (For operations like call, where the struct form is
needed, insertelement operations are inserted).

Compared to LGC's PatchBufferOp (

32cda89776/lgc/patch/PatchBufferOp.cpp
): this pass
- Also handles vectors of ptr addrspace(7)s
- Also handles function boundaries
- Includes the same uniform buffer optimization for loops and
conditionals
- Does *not* handle memcpy() and friends (this is future work)
- Does *not* break up large loads and stores into smaller parts. This
should be handled by extending the legalization
of *.buffer.{load,store} to handle larger types by producing multiple
instructions (the same way ordinary LOAD and STORE are legalized). That
work is planned for a followup commit.
- Does *not* have special logic for handling divergent buffer
descriptors. The logic in LGC is, as far as I can tell, incorrect in
general, and, per discussions with @nhaehnle, isn't widely used.
Therefore, divergent descriptors are handled with waterfall loops later
in legalization.

As a final matter, this commit updates atomic expansion to treat buffer
operations analogously to global ones.

(One question for reviewers: is the new pass is the right place? Should
it be later in the pipeline?)

Differential Revision: https://reviews.llvm.org/D158463
2024-03-06 09:49:58 -06:00
Emma Pilkington
4490003a22
[AMDGPU] Rename COV module flag to amdhsa_code_object_version (#79905)
The previous name 'amdgpu_code_object_version', was misleading since
this is really a property of the HSA OS. The new spelling also matches
the asm directive I added in bc82cfb.
2024-03-06 09:51:48 -05:00
Bjorn Pettersson
da591d390e [GlobalISel][TableGen] Take first result for multi-output instructions (#81130)
Previously, tblgen would reject patterns where one of its nested
instructions produced more than one result. These arise when the
instruction definition contains 'outs' as well as 'Defs'. This patch
fixes that by always taking the first result, which is how these
situations are handled in SelectionIDAG.

Original patch: https://reviews.llvm.org/D86617
Continued as: https://github.com/llvm/llvm-project/pull/81130
2024-03-02 20:10:02 +01:00
Petar Avramovic
0d572c41f9
AMDGPU\GlobalISel: remove amdgpu-global-isel-risky-select flag (#83426)
AMDGPUInstructionSelector should no longer attempt to select S1 G_PHIs.
Remove MIR test that attempts to inst-select divergent vcc(S1) G_PHI.
Lane mask merging algorithm for GlobalISel is now responsible for
selecting divergent S1 G_PHIs in AMDGPUGlobalISelDivergenceLowering.
Uniform S1 G_PHIs should be lowered to S32 G_PHIs in reg bank select
pass. In summary S1 G_PHIs should not reach AMDGPUInstructionSelector.
2024-02-29 15:38:54 +01:00
Petar Avramovic
6c2eec5cea
AMDGPU/GlobalISel: lane masks merging (#73337)
Basic implementation of lane mask merging for GlobalISel.
Lane masks on GlobalISel are registers with sgpr register class
and S1 LLT - required by machine uniformity analysis.
Implements equivalent of lowerPhis from SILowerI1Copies.cpp in:
patch 1: https://github.com/llvm/llvm-project/pull/75340
patch 2: https://github.com/llvm/llvm-project/pull/75349
patch 3: https://github.com/llvm/llvm-project/pull/80003
patch 4: https://github.com/llvm/llvm-project/pull/78431
patch 5: is in this commit:

AMDGPU/GlobalISelDivergenceLowering: constrain incoming registers

Previously, in PHIs that represent lane masks, incoming registers
taken as-is were not selected as lane masks. Such registers are not
being merged with another lane mask and most often only have S1 LLT.
Implement constrainAsLaneMask by constraining incoming registers
taken as-is with lane mask attributes, essentially transforming them
to lane masks. This is final step in having PHI instructions created
in this pass to be fully instruction-selected.
2024-02-29 13:57:59 +01:00
Petar Avramovic
3e35ba53e2
AMDGPU/GFX12: Insert waitcnts before stores with scope_sys (#82996)
Insert waitcnts for loads and atomics before stores with system scope.
Scope is field in instruction encoding and corresponds to desired
coherence level in cache hierarchy.
Intrinsic stores can set scope in cache policy operand.
If volatile keyword is used on generic stores memory legalizer will set
scope to system. Generic stores, by default, get lowest scope level.
Waitcnts are not required if it is guaranteed that memory is cached.
For example vulkan shaders can guarantee this.
TODO: implement flag for frontends to give us a hint not to insert
waits.
Expecting vulkan flag to be implemented as vulkan:private MMRA.
2024-02-28 16:18:04 +01:00
Matt Arsenault
ca66f7469f AMDGPU: Merge tests for llvm.amdgcn.dispatch.id 2024-02-27 18:42:40 +05:30
Matt Arsenault
e7900e695e AMDGPU: Regenerate baseline mir tests 2024-02-27 10:44:53 +05:30
Petar Avramovic
433f8e741e
MachineSSAUpdater: use all vreg attributes instead of reg class only (#78431)
When initializing MachineSSAUpdater save all attributes of current
virtual register and create new virtual registers with same attributes.
Now new virtual registers have same both register class or bank and LLT.
Previously new virtual registers had same register class but LLT was not
set (LLT was set to default/empty LLT).
Required by GlobalISel for AMDGPU, new 'lane mask' virtual registers
created by MachineSSAUpdater need to have both register class and LLT.

patch 4 from: https://github.com/llvm/llvm-project/pull/73337
2024-02-26 13:46:13 +01:00
Pierre van Houtryve
4235e44d4c
[GlobalISel] Constant-fold G_PTR_ADD with different type sizes (#81473)
All other opcodes in the list are constrained to have the same type on
both operands, but not G_PTR_ADD.

Fixes  #81464
2024-02-22 13:15:26 +01:00
Nick Anderson
8bd327d6fe
[AMDGPU][GlobalISel] Add fdiv / sqrt to rsq combine (#78673)
Fixes #64743
2024-02-22 09:47:36 +01:00
Nick Anderson
5db49f7266
[GlobalISel] replace right identity X * -1.0 with fneg(x) (#80526)
follow up patch to #78673

@Pierre-vh @jayfoad @arsenm Could you review when you have a chance.
2024-02-21 09:41:59 +00:00
David Green
1b12974ccb
[AArch64][AMDGPU][GlobalISel] Remove vector handling from unmerge_dead_to_trunc (#82224)
This combine transforms an unmerge where only the first element is used
into a truncate. That works OK for scalar but for vector needs to insert
a bitcast to integers, perform the truncate then bitcast back to
vectors. This generates more awkward code than using an Unmerge.
2024-02-20 10:54:44 +00:00
Pierre van Houtryve
87d7711934
[AMDGPU][SIMemoryLegalizer] Fix order of GL0/1_INV on GFX10/11 (#81450)
Fixes SWDEV-443292
2024-02-13 09:07:51 +01:00
sstipanovic
785eddd7a7
[AMDGPU][GlobalIsel] Introduce isRegisterClassType to check for legal types, instead of checking bit width. (#68189)
In D151116 it was suggested to have a set of classes to cover every
possible case. This does it for bitcast first.

closes #79578
2024-02-13 08:26:10 +01:00
Pierre van Houtryve
f93aa5157a
[AMDGPU] Introduce GFX9/10.1/10.3/11 Generic Targets (#76955)
These generic targets include multiple GPUs and will, in the future,
provide a way to build once and run on multiple GPU, at the cost of less
optimization opportunities.

Note that this is just doing the compiler side of things, device libs an
runtimes/loader/etc. don't know about these targets yet, so none of them
actually work in practice right now. This is just the initial commit to
make LLVM aware of them.

This contains the documentation changes for both this change and #76954
as well.
2024-02-12 10:18:20 +01:00
Jan Patrick Lehr
f661057865
Revert "[AMDGPU] Compiler should synthesize private buffer resource descriptor from flat_scratch_init" (#81234)
Reverts llvm/llvm-project#79586

This broke the AMDGPU OpenMP Offload buildbot.
The typical error message was that the GPU attempted to read beyong the
largest legal address.

Error message:
AMDGPU fatal error 1: Received error in queue 0x7f8363f22000:
HSA_STATUS_ERROR_MEMORY_APERTURE_VIOLATION: The agent attempted to
access memory beyond the largest legal address.
2024-02-09 09:57:38 +01:00
Diana Picus
bc6955f18c
[AMDGPU] Don't fix the scavenge slot at offset 0 (#79136)
At the moment, the emergency spill slot is a fixed object for entry
functions and chain functions, and a regular stack object otherwise.
This patch adopts the latter behaviour for entry/chain functions too. It
seems this was always the intention [1] and it will also save us a bit
of stack space in cases where the first stack object has a large
alignment.

[1]
34c8b835b1
2024-02-09 09:20:25 +01:00
alex-t
88e52511ca
[AMDGPU] Compiler should synthesize private buffer resource descriptor from flat_scratch_init (#79586)
This change implements synthesizing the private buffer resource
descriptor in the kernel prolog instead of using the preloaded kernel
argument.
2024-02-08 20:27:36 +01:00
Ivan Kosarev
7d19dc50de
[AMDGPU][True16] Support VOP3 source DPP operands. (#80892) 2024-02-08 16:23:00 +00:00
Carl Ritson
9bda1de0b6
[TwoAddressInstruction] Propagate undef flags for partial defs (#79286)
If part of a register (lowered from REG_SEQUENCE) is undefined then we
should propagate undef flags to uses of those lanes. This is only
performed when live intervals are present as it requires live intervals
to correctly match uses to defs, and the primary goal is to allow
precise computation of subrange intervals.
2024-02-07 16:46:00 +09:00
choikwa
e5638c5a00
[AMDGPU] Use correct number of bits needed for div/rem shrinking (#80622)
There was an error where dividend of type i64 and actual used number of
bits of 32 fell into path that assumes only 24 bits being used. Check
that AtLeast field is used correctly when using computeNumSignBits and
add necessary extend/trunc for 32 bits path.

Regolden and update testcases.

@jrbyrnes @bcahoon @arsenm @rampitec
2024-02-06 21:32:28 +05:30
Matt Arsenault
42b5b720ca AMDGPU/GlobalISel: Fix not running -global-isel in global isel test 2024-02-06 14:55:48 +05:30
Petar Avramovic
06f711a906
AMDGPU/GlobalISelDivergenceLowering: select divergent i1 phis (#80003)
Implement PhiLoweringHelper for GlobalISel in DivergenceLoweringHelper.
Use machine uniformity analysis to find divergent i1 phis and select
them as lane mask phis in same way SILowerI1Copies select VReg_1 phis.
Note that divergent i1 phis include phis created by LCSSA and all cases
of uses outside of cycle are actually covered by "lowering LCSSA phis".
GlobalISel lane masks are registers with sgpr register class and S1 LLT.

TODO: General goal is that instructions created in this pass are fully
instruction-selected so that selection of lane mask phis is not split
across multiple passes.

patch 3 from: https://github.com/llvm/llvm-project/pull/73337
2024-02-05 14:07:01 +01:00
Nikita Popov
00a4e248dc [AMDGPU] Convert tests to opaque pointers (NFC) 2024-02-05 12:42:23 +01:00
Pierre van Houtryve
500846d2f5
[AMDGPU] Introduce Code Object V6 (#76954)
Introduce Code Object V6 in Clang, LLD, Flang and LLVM. This is the same
as V5 except a new "generic version" flag can be present in EFLAGS. This
is related to new generic targets that'll be added in a follow-up patch.
It's also likely V6 will have new changes (possibly new metadata
entries) added later.

Docs change are part of the follow-up patch #76955
2024-02-05 08:19:53 +01:00
Quentin Dian
112fba974c
[MIRPrinter] Don't print line break when there is no instructions (NFC) (#80147)
Per #80143, we can remove the extra line break when there is no
instruction.
2024-02-01 22:10:52 +08:00
Quentin Dian
b7738e275d
[MIRPrinter] Don't print space when there is no successor (#80143)
Extra space causes the checks generated by update_mir_test_checks to be
unavailable.

```
# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 4
# RUN: llc -mtriple=x86_64-- -o - %s -run-pass=none -verify-machineinstrs -simplify-mir | FileCheck %s
---
name: foo
body: |
  ; CHECK-LABEL: name: foo
  ; CHECK: bb.0:
  ; CHECK-NEXT:   successors:
  ; CHECK-NEXT: {{  $}}
  ; CHECK-NEXT: {{  $}}
  ; CHECK-NEXT: bb.1:
  ; CHECK-NEXT:   RET 0, $eax
  bb.0:
    successors:

  bb.1:
    RET 0, $eax
...
```

The failure log is as follows:

```
llvm/test/CodeGen/MIR/X86/unreachable-block-print.mir:9:16: error: CHECK-NEXT: is on the same line as previous match
 ; CHECK-NEXT: {{ $}}
               ^
<stdin>:21:13: note: 'next' match was here
 successors:
            ^
<stdin>:21:13: note: previous match ended here
 successors:
```
2024-01-31 22:35:41 +08:00
Krzysztof Drewniak
63fe80fb18
[SeperateConstOffsetFromGEP] Handle or disjoint flags (#76997)
This commit extends separate-const-offset-from-gep to look at the
newly-added `disjoint` flag on `or` instructions so as to preserve
additional opportunities for optimization.

The tests were pre-committed in #76972.
2024-01-26 09:56:06 -06:00
Jay Foad
c5d59fe1b2
[AMDGPU] Disable V_MAD_U64_U32/V_MAD_I64_I32 workaround for GFX11.5 (#79460)
The hardware bug only affects GFX11.0.x.
2024-01-25 16:28:49 +00:00
Mirko Brkušanin
7fdf608cef
[AMDGPU] Add GFX12 WMMA and SWMMAC instructions (#77795)
Co-authored-by: Petar Avramovic <Petar.Avramovic@amd.com>
Co-authored-by: Piotr Sobczak <piotr.sobczak@amd.com>
2024-01-24 13:43:07 +01:00
Petar Avramovic
c46109d0d7
Revert "AMDGPU/GlobalISelDivergenceLowering: select divergent i1 phis" (#79274)
Reverts llvm/llvm-project#78482
2024-01-24 12:18:34 +01:00
Petar Avramovic
91ddcba83a
AMDGPU/GlobalISelDivergenceLowering: select divergent i1 phis (#78482)
Implement PhiLoweringHelper for GlobalISel in DivergenceLoweringHelper.
Use machine uniformity analysis to find divergent i1 phis and select
them as lane mask phis in same way SILowerI1Copies select VReg_1 phis.
Note that divergent i1 phis include phis created by LCSSA and all cases
of uses outside of cycle are actually covered by "lowering LCSSA phis".
GlobalISel lane masks are registers with sgpr register class and S1 LLT.

TODO: General goal is that instructions created in this pass are fully
instruction-selected so that selection of lane mask phis is not split
across multiple passes.

patch 3 from: https://github.com/llvm/llvm-project/pull/73337
2024-01-24 11:58:32 +01:00
Emma Pilkington
4897b9888f
[AMDGPU] Make a few more tests default COV agnostic (#78926) 2024-01-22 11:22:57 -05:00
Pierre van Houtryve
ac296b696c
[AMDGPU] Drop verify from SIMemoryLegalizer tests (#78697)
SIMemoryLegalizer tests were slow, with most of them taking 4.5 to 5.3s
to complete and that's on a fast machine. I also recall seeing them in
the slowest tests list on build bots.

This removes the verify-machineinstrs option from these tests to speed
them up, bringing the slowest test down to +-2s.
Verifier still runs in EXPENSIVE_CHECKS builds.
2024-01-22 10:31:37 +01:00