7132 Commits

Author SHA1 Message Date
Jay Foad
70fc970378
[AMDGPU] Move architected SGPR implementation into isel (#79120) 2024-01-24 15:06:20 +00:00
Nikita Popov
90ba33099c
[InstCombine] Canonicalize constant GEPs to i8 source element type (#68882)
This patch canonicalizes getelementptr instructions with constant
indices to use the `i8` source element type. This makes it easier for
optimizations to recognize that two GEPs are identical, because they
don't need to see past many different ways to express the same offset.

This is a first step towards
https://discourse.llvm.org/t/rfc-replacing-getelementptr-with-ptradd/68699.
This is limited to constant GEPs only for now, as they have a clear
canonical form, while we're not yet sure how exactly to deal with
variable indices.

The test llvm/test/Transforms/PhaseOrdering/switch_with_geps.ll gives
two representative examples of the kind of optimization improvement we
expect from this change. In the first test SimplifyCFG can now realize
that all switch branches are actually the same. In the second test it
can convert it into simple arithmetic. These are representative of
common optimization failures we see in Rust.

Fixes https://github.com/llvm/llvm-project/issues/69841.
2024-01-24 15:25:29 +01:00
Mirko Brkušanin
7fdf608cef
[AMDGPU] Add GFX12 WMMA and SWMMAC instructions (#77795)
Co-authored-by: Petar Avramovic <Petar.Avramovic@amd.com>
Co-authored-by: Piotr Sobczak <piotr.sobczak@amd.com>
2024-01-24 13:43:07 +01:00
Mariusz Sikora
cfddb59be2
[AMDGPU][GFX12] VOP encoding and codegen - add support for v_cvt fp8/… (#78414)
…bf8 instructions

    Add VOP1, VOP1_DPP8, VOP1_DPP16, VOP3, VOP3_DPP8, VOP3_DPP16
    instructions that were supported on GFX940 (MI300):
    - V_CVT_F32_FP8
    - V_CVT_F32_BF8
    - V_CVT_PK_F32_FP8
    - V_CVT_PK_F32_BF8
    - V_CVT_PK_FP8_F32
    - V_CVT_PK_BF8_F32
    - V_CVT_SR_FP8_F32
    - V_CVT_SR_BF8_F32

---------

Co-authored-by: Mateja Marjanovic <mateja.marjanovic@amd.com>
Co-authored-by: Mirko Brkušanin <Mirko.Brkusanin@amd.com>
2024-01-24 12:21:15 +01:00
Petar Avramovic
c46109d0d7
Revert "AMDGPU/GlobalISelDivergenceLowering: select divergent i1 phis" (#79274)
Reverts llvm/llvm-project#78482
2024-01-24 12:18:34 +01:00
Petar Avramovic
149ed9d2c5
AMDGPU: update GFX11 wmma hazards (#76143)
One V_NOP or unrelated VALU instruction in between is required for
correctness when matrix A or B of current WMMA instruction overlaps with
matrix D of previous WMMA instruction.
Remaining cases of WMMA operand overlaps are handled by the hardware and
do not require handling in hazard recognizer.

Hardware may stall in cases where:
- matrix C of current WMMA instruction overlaps with matrix D of
previous WMMA instruction
- VALU instruction reads matrix D of previous WMMA instruction
- matrix A,B or C of WMMA instruction reads result of previous VALU
instruction
2024-01-24 12:00:35 +01:00
Petar Avramovic
91ddcba83a
AMDGPU/GlobalISelDivergenceLowering: select divergent i1 phis (#78482)
Implement PhiLoweringHelper for GlobalISel in DivergenceLoweringHelper.
Use machine uniformity analysis to find divergent i1 phis and select
them as lane mask phis in same way SILowerI1Copies select VReg_1 phis.
Note that divergent i1 phis include phis created by LCSSA and all cases
of uses outside of cycle are actually covered by "lowering LCSSA phis".
GlobalISel lane masks are registers with sgpr register class and S1 LLT.

TODO: General goal is that instructions created in this pass are fully
instruction-selected so that selection of lane mask phis is not split
across multiple passes.

patch 3 from: https://github.com/llvm/llvm-project/pull/73337
2024-01-24 11:58:32 +01:00
Christudasan Devadasan
230c13d59d
[AMDGPU] Pick available high VGPR for CSR SGPR spilling (#78669)
CSR SGPR spilling currently uses the early available physical VGPRs. It
currently imposes a high register pressure while trying to allocate
large VGPR tuples within the default register budget.

This patch changes the spilling strategy by picking the VGPRs in the
reverse order, the highest available VGPR first and later after regalloc
shift them back to the lowest available range. With that, the initial
VGPRs would be available for allocation and possibility
of finding large number of contiguous registers will be more.
2024-01-24 07:08:43 +05:30
Changpeng Fang
32073b8356
AMDGPU: Do not generate non-temporal hint when Load_Tr intrinsic did not specify it (#79104)
int_amdgcn_global_load_tr did not specify non-temporal load transpose,
thus we should
not genetrate the non-temporal hint for the load. We need to implement
getTgtMemIntrinsic
to create the corresponding MemSDNode. And we don't set the non-temporal
flag because
the intrinsic did not specify it.

NOTE: We need to implement getTgtMemIntrinsic for any memory intrinsics.
2024-01-23 10:05:32 -08:00
Jay Foad
6cf37dd504
[AMDGPU] Enable architected SGPRs for GFX12 (#79160) 2024-01-23 16:36:30 +00:00
Mirko Brkušanin
6bb7d515c3
[AMDGPU] Properly check op_sel in GCNDPPCombine (#79122) 2024-01-23 17:21:16 +01:00
Pierre van Houtryve
42b0884238
[AMDGPU] Handle V_PERMLANE64_B32 in fixVcmpxPermlaneHazards (#79125)
Fixes #78856
2024-01-23 13:10:58 +01:00
Carl Ritson
4db4d7f282
[AMDGPU] SILowerSGPRSpills: do not update MRI reserve registers (#77888)
VGPRs used for spilling do not require explicit reservation with MRI.
freezeReservedRegs() executed before register allocation ensures these
are placed in the reserve set.

The only pass after SILowerSGPRSpills is SIPreAllocateWWMRegs which
explicitly tests for interference before register allocation so should
not reuse a WWM VGPR holding spill data. reserveReg prevents calculation
of correct liveness for physical registers which could be used to extend
SIPreAllocateWWMRegs.
2024-01-23 10:49:26 +09:00
Emma Pilkington
4897b9888f
[AMDGPU] Make a few more tests default COV agnostic (#78926) 2024-01-22 11:22:57 -05:00
Jeremy Morse
52a8bed426
[DebugInfo][RemoveDIs] Adjust AMDGPU passes to work with DPValues (#78736)
This patch tweaks two AMDGPU passes to use iterators rather than
instruction pointers for expressing an insertion point. This is needed
to accurately support DPValues, the non-instruction storage object for
debug-info.

Two tests were sensitive to this change (variable assignments were being
put in the wrong place), and I've added extra run-lines with the "try
new debug-info..." flag. These get tested on our public buildbot to
ensure they continue to work accurately.
2024-01-22 14:25:08 +00:00
Pierre van Houtryve
ac296b696c
[AMDGPU] Drop verify from SIMemoryLegalizer tests (#78697)
SIMemoryLegalizer tests were slow, with most of them taking 4.5 to 5.3s
to complete and that's on a fast machine. I also recall seeing them in
the slowest tests list on build bots.

This removes the verify-machineinstrs option from these tests to speed
them up, bringing the slowest test down to +-2s.
Verifier still runs in EXPENSIVE_CHECKS builds.
2024-01-22 10:31:37 +01:00
Emma Pilkington
bc82cfb38d
[AMDGPU] Add an asm directive to track code_object_version (#76267)
Named '.amdhsa_code_object_version'. This directive sets the
e_ident[ABIVERSION] in the ELF header, and should be used as the assumed
COV for the rest of the asm file.

This commit also weakens the --amdhsa-code-object-version CL flag.
Previously, the CL flag took precedence over the IR flag. Now the IR
flag/asm directive take precedence over the CL flag. This is implemented
by merging a few COV-checking functions in AMDGPUBaseInfo.h.
2024-01-21 11:54:47 -05:00
Jay Foad
63d7ca924f
[AMDGPU] Add GFX12 llvm.amdgcn.s.wait.*cnt intrinsics (#78723) 2024-01-20 11:44:42 +00:00
Jay Foad
89226ecbb9
[AMDGPU] Do not widen scalar loads on GFX12 (#78724)
GFX12 has subword scalar loads so there is no need to do this.
2024-01-19 15:30:07 +00:00
Jay Foad
ed12388082
[AMDGPU] Do not emit V_DOT2C_F32_F16_e32 on GFX12 (#78709)
That instruction is not supported on GFX12.
Added a testcase which previously crashed without this change.

Co-authored-by: pvanhout <pierre.vanhoutryve@amd.com>
2024-01-19 14:36:27 +00:00
Jay Foad
879cbe06ed
[AMDGPU] Fix predicates for BUFFER_ATOMIC_CSUB pattern (#78701)
Use OtherPredicates to avoid interfering with other uses of
SubtargetPredicate for GFX12.
2024-01-19 12:01:31 +00:00
Mirko Brkušanin
0185c76456
[AMDGPU] Fix test for expensive-checks build (#78687) 2024-01-19 11:32:02 +01:00
Leon Clark
2759cfa0c3
[AMDGPU] Remove unnecessary add instructions in ctlz.i8 (#77615)
Add custom lowering for ctlz.i8 to avoid multiple add/sub operations.

---------

Co-authored-by: Leon Clark <leoclark@amd.com>
Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com>
2024-01-19 10:16:46 +00:00
Christudasan Devadasan
4d566e57a2 [AMDGPU] Precommit lit test. 2024-01-19 09:32:03 +05:30
Piotr Sobczak
57f6a3f7ea
[AMDGPU] Add global_load_tr for GFX12 (#77772)
Support new amdgcn_global_load_tr instructions for load with transpose.

* MC layer support for GLOBAL_LOAD_TR_B64/GLOBAL_LOAD_TR_B128
* Intrinsic int_amdgcn_global_load_tr
* Clang builtins amdgcn_global_load_tr*
2024-01-18 15:14:42 +01:00
Jay Foad
745b193260 [AMDGPU] Regenerate tests for #77892 after #77438 2024-01-18 13:50:59 +00:00
Jay Foad
0a3a0ea591
[AMDGPU] Update uses of new VOP2 pseudos for GFX12 (#78155)
New pseudos were added for instructions that were natively VOP3 on
GFX11: V_ADD_F64_pseudo, V_MUL_F64_pseudo, V_MIN_NUM_F64, V_MAX_NUM_F64,
V_LSHLREV_B64_pseudo

---------

Co-authored-by: Mirko Brkusanin <Mirko.Brkusanin@amd.com>
2024-01-18 13:26:13 +00:00
Mariusz Sikora
3e6589f21c
[AMDGPU][GFX12] Add 16 bit atomic fadd instructions (#75917)
- image_atomic_pk_add_f16
- image_atomic_pk_add_bf16
- ds_pk_add_bf16
- ds_pk_add_f16
- ds_pk_add_rtn_bf16
- ds_pk_add_rtn_f16
- flat_atomic_pk_add_f16
- flat_atomic_pk_add_bf16
- global_atomic_pk_add_f16
- global_atomic_pk_add_bf16
- buffer_atomic_pk_add_f16
- buffer_atomic_pk_add_bf16
2024-01-18 14:01:09 +01:00
Mariusz Sikora
28b7e498b6
AMDGPU/GFX12: Add new dot4 fp8/bf8 instructions (#77892)
Endoding is VOP3P. Tagged as deep/machine learning instructions. i32
type (v4fp8 or v4bf8 packed in i32) is used for src0 and src1. src0 and
src1 have no src_modifiers. src2 is f32 and has src_modifiers: f32
fneg(neg_lo[2]) and f32 fabs(neg_hi[2]).

---------

Co-authored-by: Petar Avramovic <Petar.Avramovic@amd.com>
2024-01-18 14:00:27 +01:00
Jay Foad
ba52f06f9d
[AMDGPU] CodeGen for GFX12 S_WAIT_* instructions (#77438)
Update SIMemoryLegalizer and SIInsertWaitcnts to use separate wait
instructions per counter (e.g. S_WAIT_LOADCNT) and split VMCNT into
separate LOADCNT, SAMPLECNT and BVHCNT counters.
2024-01-18 10:47:45 +00:00
Jay Foad
9ca36932b5
[AMDGPU] Work around s_getpc_b64 zero extending on GFX12 (#78186) 2024-01-18 10:23:27 +00:00
Jay Foad
c111dc72e9
[AMDGPU] Allow potentially negative flat scratch offsets on GFX12 (#78193)
https://github.com/llvm/llvm-project/pull/70634 has disabled use
of potentially negative scratch offsets, but we can use it on GFX12.

---------

Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>
2024-01-18 10:02:40 +00:00
Ivan Kosarev
2a869ced61
[AMDGPU][True16] Support V_FLOOR_F16. (#78446) 2024-01-18 08:43:47 +00:00
Mirko Brkušanin
1d286ad59b
[AMDGPU] Add mark last scratch load pass (#75512) 2024-01-18 09:36:44 +01:00
Stanislav Mekhanoshin
021def6c22
[AMDGPU] Use alias info to relax waitcounts for LDS DMA (#74537)
LDA DMA loads increase VMCNT and a load from the LDS stored must wait on
this counter to only read memory after it is written. Wait count
insertion pass does not track memory dependencies, it tracks register
dependencies. To model the LDS dependency a pseudo register is used in
the scoreboard, acting like if LDS DMA writes it and LDS load reads it.

This patch adds 8 more pseudo registers to use for independent LDS
locations if we can prove they are disjoint using alias analysis.

Fixes: SWDEV-433427
2024-01-17 23:44:15 -08:00
Matt Arsenault
11bf02e019
DAG: Fix ABI lowering with FP promote in strictfp functions (#74405)
This was emitting non-strict casts in ABI contexts for illegal
types.
2024-01-18 10:57:53 +07:00
Stanislav Mekhanoshin
558ea41159
[AMDGPU] Reapply 'Sign extend simm16 in setreg intrinsic' (#78492)
We currently force users to use a negative contant in the intrinsic
call. Changing it zext would break existing programs, so just sign
extend an argument.
2024-01-17 17:23:46 -08:00
Mariusz Sikora
c99da46fc1
[AMDGPU][GFX12] Add Atomic cond_sub_u32 (#76224)
Co-authored-by: Vang Thao <Vang.Thao@amd.com>
2024-01-17 19:23:42 +01:00
Petar Avramovic
90bdf76fdb
Revert "AMDGPU/GlobalISelDivergenceLowering: select divergent i1 phis" (#78468)
Reverts llvm/llvm-project#76145
2024-01-17 17:41:19 +01:00
Jay Foad
e4c8c58517
[AMDGPU] Src1 of VOP3 DPP instructions can be SGPR on GFX12 (#77929) 2024-01-17 15:57:36 +00:00
Matt Arsenault
af4f1766ae
AMDGPU: Allocate special SGPRs before user SGPR arguments (#78234) 2024-01-17 21:41:50 +07:00
Jay Foad
f12059eb3f
[AMDGPU] Fix llvm.amdgcn.s.wait.event.export.ready for GFX12 (#78191)
The meaning of bit 0 of the immediate operand of S_WAIT_EVENT has been
flipped from GFX11.
2024-01-17 11:59:15 +00:00
Jay Foad
e9e9d1b0b1
[AMDGPU] Disable V_MAD_U64_U32/V_MAD_I64_I32 workaround for GFX12 (#77927) 2024-01-17 11:52:19 +00:00
Petar Avramovic
1fbf533286
AMDGPU/GlobalISelDivergenceLowering: select divergent i1 phis (#76145)
Implement PhiLoweringHelper for GlobalISel in DivergenceLoweringHelper.
Use machine uniformity analysis to find divergent i1 phis and select
them as lane mask phis in same way SILowerI1Copies select VReg_1 phis.
Note that divergent i1 phis include phis created by LCSSA and all cases
of uses outside of cycle are actually covered by "lowering LCSSA phis".
GlobalISel lane masks are registers with sgpr register class and S1 LLT.

TODO: General goal is that instructions created in this pass are fully
instruction-selected so that selection of lane mask phis is not split
across multiple passes.

patch 3 from: https://github.com/llvm/llvm-project/pull/73337
2024-01-17 12:10:24 +01:00
Jay Foad
4a77414660
[AMDGPU] CodeGen for GFX12 8/16-bit SMEM loads (#77633) 2024-01-17 10:28:03 +00:00
Jay Foad
42b9ea841e
[AMDGPU] Increase max scratch allocation for GFX12 (#77625) 2024-01-17 10:25:28 +00:00
Jay Foad
36ef291d63
[AMDGPU] Fix hang caused by VS_CNT handling at calls (#78318)
Fix a potential hang introduced by #77439 and #77935. This line:

  setScoreUB(VS_CNT, getScoreLB(VS_CNT) + getWaitCountMax(VS_CNT));

could potentialy set UB lower than it was before, which confused
SIInsertWaitcnts's fixed point algorithm.

This was only triggered a STORE instruction with an implicit-def, which
seems odd but apparently happens for some spills.
2024-01-17 10:24:29 +00:00
Matt Arsenault
53a3c738a9 AMDGPU: Remove fixed fixme from a test 2024-01-17 16:52:50 +07:00
Fangrui Song
9e9907f1cf
[AMDGPU,test] Change llc -march= to -mtriple= (#75982)
Similar to 806761a7629df268c8aed49657aeccffa6bca449.

For IR files without a target triple, -mtriple= specifies the full
target triple while -march= merely sets the architecture part of the
default target triple, leaving a target triple which may not make sense,
e.g. amdgpu-apple-darwin.

Therefore, -march= is error-prone and not recommended for tests without
a target triple. The issue has been benign as we recognize
$unknown-apple-darwin as ELF instead of rejecting it outrightly.

This patch changes AMDGPU tests to not rely on the default
OS/environment components. Tests that need fixes are not changed:

```
  LLVM :: CodeGen/AMDGPU/fabs.f64.ll
  LLVM :: CodeGen/AMDGPU/fabs.ll
  LLVM :: CodeGen/AMDGPU/floor.ll
  LLVM :: CodeGen/AMDGPU/fneg-fabs.f64.ll
  LLVM :: CodeGen/AMDGPU/fneg-fabs.ll
  LLVM :: CodeGen/AMDGPU/r600-infinite-loop-bug-while-reorganizing-vector.ll
  LLVM :: CodeGen/AMDGPU/schedule-if-2.ll
```
2024-01-16 21:54:58 -08:00
Florian Mayer
f3190c78ec
Revert "[AMDGPU] Sign extend simm16 in setreg intrinsic" (#78372)
Reverts llvm/llvm-project#77997

Broke UBSan bots.
2024-01-16 16:37:48 -08:00