5314 Commits

Author SHA1 Message Date
Stanislav Mekhanoshin
72c1a0d9c2 [AMDGPU] Allow v_accvgpr_write to use SGPR on gfx90a
This is undocumented, but it should work.

Differential Revision: https://reviews.llvm.org/D122252
2022-03-22 13:52:29 -07:00
Stanislav Mekhanoshin
631a643940 [AMDGPU] Update mfma test to run gfx940 checks. NFC. 2022-03-22 12:43:57 -07:00
alex-t
0a488cba2c [AMDGPU] use scalar shift for SALU users in frame index elimination
In the frame index lowering we have to insert shift and add
instructions to adjust stack object access.  We need to take care of the stack
object user kind and use scalar shift/add for scalar users.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D121524
2022-03-22 11:43:23 +01:00
Carl Ritson
8e64d84995 [MachineSink] Check block prologue interference
Sinking must check for interference between the block prologue
and the instruction being sunk.
Specifically check for clobbering of uses by the prologue, and
overwrites to prologue defined registers by the sunk instruction.

Reviewed By: rampitec, ruiling

Differential Revision: https://reviews.llvm.org/D121277
2022-03-22 11:15:37 +09:00
alex-t
a0ea7ec90f [AMDGPU] divergence patterns for the BUILD_VECTOR i16, undef expansion.
BUILD_VECTOR of i16 and undef gets expanded to the COPY_TO_REGCLASS.
         The latter is further lowererd to the copy instructions.
	 We need to provide the correct register class for the uniform and divergent BUILD_VECTOR nodes
	 to avoid VGPR to SGPR copies.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D122068
2022-03-21 21:11:20 +01:00
Jay Foad
321c8ab81b [AMDGPU] Add an agpr copy propagation test 2022-03-21 11:42:57 +00:00
Jay Foad
692341e998 [AMDGPU] Update checks in agpr-copy-propagation.mir 2022-03-21 11:42:56 +00:00
Thomas Symalla
7de6107dce Revert "[AMDGPU] Improve v_cmpx usage on GFX10.3."
This reverts commit 011c64191ef9ccc6538d52f4b57f98f37d4ea36e and
e725e2afe02e18398525652c9bceda1eb055ea64.

Differential Revision: https://reviews.llvm.org/D122117
2022-03-21 09:50:44 +01:00
Thomas Symalla
011c64191e [AMDGPU] Improve v_cmpx usage on GFX10.3.
On GFX10.3 targets, the following instruction sequence

v_cmp_* SGPR, ...
s_and_saveexec ..., SGPR

leads to a fairly long stall caused by a VALU write to a SGPR and having the
following SALU wait for the SGPR.

An equivalent sequence is to save the exec mask manually instead of letting
s_and_saveexec do the work and use a v_cmpx instruction instead to do the
comparison.

This patch modifies the SIOptimizeExecMasking pass as this is the last position
where s_and_saveexec instructions are inserted. It does the transformation by
trying to find the pattern, extracting the operands and generating the new
instruction sequence.

It also changes some existing lit tests and introduces a few new tests to show
the changed behavior on GFX10.3 targets.

Reviewed By: sebastian-ne, critson

Differential Revision: https://reviews.llvm.org/D119696
2022-03-21 09:31:59 +01:00
Stanislav Mekhanoshin
d898c9563e [AMDGPU] Add gfx940 run line to gfx90a mfma test. NFC. 2022-03-18 15:23:47 -07:00
Abinav Puthan Purayil
aee3684995 [AMDGPU] Use COPY_TO_REGCLASS for buffer_atomic_cmpswap selection
GlobalISel was selecting the av_* regclass for some cases.

Differential Revision: https://reviews.llvm.org/D121933
2022-03-18 08:56:23 +05:30
Stanislav Mekhanoshin
275b0c5a5a [AMDGPU] Add 2 gfx940 mfma tests. NFC. 2022-03-17 15:47:13 -07:00
Changpeng Fang
dd5895cc39 AMDGPU: Use the implicit kernargs for code object version 5
Summary:
  Specifically, for trap handling, for targets that do not support getDoorbellID,
we load the queue_ptr from the implicit kernarg, and move queue_ptr to s[0:1].
To get aperture bases when targets do not have aperture registers, we load
private_base or shared_base directly from the implicit kernarg. In clang, we use
implicitarg_ptr + offsets to implement __builtin_amdgcn_workgroup_size_{xyz}.

Reviewers: arsenm, sameerds, yaxunl

Differential Revision: https://reviews.llvm.org/D120265
2022-03-17 14:12:36 -07:00
Stanislav Mekhanoshin
522b259976 [AMDGPU] Allow v_accvgpr_write to use SGPR src on gfx940
Differential Revision: https://reviews.llvm.org/D121843
2022-03-17 12:12:06 -07:00
Vang Thao
27e1931508 [AMDGPU] Fix PreRARematerialize scheduler pass sinking subreg defs
When collecting trivially rematerializable defs, skip any subreg defs. We do not want to sink these.

Differential Revision: https://reviews.llvm.org/D121874
2022-03-17 11:38:53 -07:00
Matt Arsenault
8d66603a48 Revert "RegAllocGreedy: Fix last chance recolor assert in impossible case"
This reverts commit c46aab01c002b7a04135b8b7f1f52d8c9ae23a58.

This evidently blocks compiling in some cases that used to work
before. I'm also not fully convinced this is the correct place to fix
this problem.
2022-03-17 13:12:01 -04:00
Abinav Puthan Purayil
f59cb41ba1 [AMDGPU] Select buffer_atomic_cmpswap* in tblgen
This change replaces the manual selection of buffer_atomic_cmpswap*
instructions in SelectionDAG and GlobalISel with a tblgen based
selection in BUFInstructions.td. This allows us to select the return and
no-return variants in tblgen.

Differential Revision: https://reviews.llvm.org/D121770
2022-03-17 10:12:32 +05:30
Christudasan Devadasan
6dd21d1db1 [AMDGPU][SIFoldOperands] Consider the alignment constraints
Enforced an alignment check while folding the operands.
2022-03-17 08:27:53 +05:30
Christudasan Devadasan
af717d4aca [AMDGPU][MachineVerifier] Alignment check for fp32 packed math instructions
The fp32 packed math instructions are introduced in gfx90a.
If their vector register operands are not properly aligned, the
verifier should flag them. Currently, the verifier failed to
report it and the compiler ended up emitting a broken assembly.
This patch fixes that missed case in TII::verifyInstruction.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D121794
2022-03-17 08:21:35 +05:30
Florian Hahn
e5822ded56
[FunctionAttrs] Infer argmemonly .
This patch adds initial argmemonly inference, by checking the underlying
objects of locations returned by MemoryLocation.

I think this should cover most cases, except function calls to other
argmemonly functions.

I'm not sure if there's a reason why we don't infer those yet.

Additional argmemonly can improve codegen in some cases. It also makes
it easier to come up with a C reproducer for 7662d1687b09 (already fixed,
but I'm trying to see if C/C++ fuzzing could help to uncover similar
issues.)

Compile-time impact:
NewPM-O3: +0.01%
NewPM-ReleaseThinLTO: +0.03%
NewPM-ReleaseLTO+g: +0.05%

https://llvm-compile-time-tracker.com/compare.php?from=067c035012fc061ad6378458774ac2df117283c6&to=fe209d4aab5b593bd62d18c0876732ddcca1614d&stat=instructions

Reviewed By: nikic

Differential Revision: https://reviews.llvm.org/D121415
2022-03-16 10:24:33 +00:00
Nikita Popov
57d57b1afd [AAEval] Make compatible with opaque pointers
With opaque pointers, we cannot use the pointer element type to
determine the LocationSize for the AA query. Instead, -aa-eval
tests are now required to have an explicit load or store for any
pointer they want to compute alias results for, and the load/store
types are used to determine the location size.

This may affect ordering of results, and sorting within one result,
as the type is not considered part of the sorted string anymore.

To somewhat minimize the churn, printing still uses faux typed
pointer notation.
2022-03-16 10:02:11 +01:00
Joe Nash
687d20de7f [AMDGPU] Regen checks again no-remat-indirect-mov
NFC. Update script does not behave right since the run lines have
identical output. Delete the duplicated check prefix added in
22cfbf7ecacdf7db47c2f65fe896bdf62ebcc0f3
2022-03-15 13:44:41 -04:00
Joe Nash
4cf86bd744 [AMDGPU] Regen checks for schedule-barrier
NFC. Hasn't been updated since script added check-next
2022-03-15 13:35:43 -04:00
Joe Nash
22cfbf7eca [AMDGPU] Regen checks for no-remat-indirect-mov
NFC. Hasn't been updated since the update script started adding
check-next.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D121719
2022-03-15 13:33:42 -04:00
Stanislav Mekhanoshin
c4500de255 [AMDGPU] gfx940: disable OP_SEL on V_DOT instructions
Differential Revision: https://reviews.llvm.org/D121634
2022-03-14 17:02:00 -07:00
Stanislav Mekhanoshin
1f53f20fc1 [AMDGPU] Support gfx940 v_lshl_add_u64 instruction
Differential Revision: https://reviews.llvm.org/D121401
2022-03-14 15:45:42 -07:00
Stanislav Mekhanoshin
36fe3f13a9 [AMDGPU] flat scratch SVS addressing mode for gfx940
Both VADDR and SADDR are used in SVS mode.

Differential Revision: https://reviews.llvm.org/D121254
2022-03-14 15:23:36 -07:00
Stanislav Mekhanoshin
47bac63d3f [AMDGPU] gfx940 memory model
Differential Revision: https://reviews.llvm.org/D121242
2022-03-14 15:01:46 -07:00
Stanislav Mekhanoshin
72a9e5f891 [AMDGPU] Restrict machine copy propagation from creating unaligned classes
Fixes: SWDEV-326366

Differential Revision: https://reviews.llvm.org/D121491
2022-03-14 14:09:40 -07:00
Austin Kerbow
62bcfcb5a5 [AMDGPU] Add llvm.amdgcn.s.setprio intrinsic
Reviewed By: rampitec, arsenm

Differential Revision: https://reviews.llvm.org/D120976
2022-03-12 22:15:42 -08:00
Stanislav Mekhanoshin
31f215ab0c [AMDGPU] Support v_mov_b64 in dpp combine
Differential Revision: https://reviews.llvm.org/D121411
2022-03-11 11:37:32 -08:00
Nikita Popov
3ed643ea76 [AMDGPUPromoteAlloca] Make compatible with opaque pointers
This mainly changes the handling of bitcasts to not check the types
being casted from/to -- we should only care about the actual
load/store types. The GEP handling is also changed to not care about
types, and just make sure that we get an offset corresponding to
a vector element.

This was a bit of a struggle for me, because this code seems to be
pretty sensitive to small changes. The end result seems to produce
strictly better results for the existing test coverage though,
because we can now deal with more situations involving bitcasts.

Differential Revision: https://reviews.llvm.org/D121371
2022-03-11 09:20:51 +01:00
Stanislav Mekhanoshin
c7f25b6fd4 [AMDGPU] Updated some tests to run on gfx940. NFC. 2022-03-10 12:34:24 -08:00
alex-t
d159b4444c [AMDGPU] Enable divergence predicates for negative inline constant subtraction
We have a pattern that undo sub x, c -> add x, -c canonicalization since c is more likely
 an inline immediate than -c. This patch enables it to select scalar or vector subtracion by the input node divergence.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D121360
2022-03-10 15:03:22 +03:00
Nikita Popov
eb4037ff42 [AMDGPU] Fix regenerated test checks (NFC)
I used the wrong build to generate the checks, sorry :(
2022-03-10 11:56:17 +01:00
Nikita Popov
611da6b582 [AMDGPU] Regenerate test checks (NFC) 2022-03-10 11:53:45 +01:00
Nikita Popov
eaac3484ab [AMDGPU] Regenerate test checks (NFC)
Also rename variables to avoid file check clash.
2022-03-10 11:32:45 +01:00
Carl Ritson
3cb9af1be2 [MachineSink] Pre-commit test for D121277. NFC. 2022-03-10 11:33:06 +09:00
Xiang1 Zhang
c31014322c TLS loads opimization (hoist)
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D120000
2022-03-10 09:29:06 +08:00
Stanislav Mekhanoshin
0be6fd44f3 [SDAG] Use MMO flags in MemSDNode folding
SDNodes with different target flags may now be folded together
rightfully resulting in the assertion in the refineAlignment.
Folding nodes with different target flags may result in the
wrong load instructions produced at least on the AMDGPU.

Fixes: SWDEV-326805

Differential Revision: https://reviews.llvm.org/D121335
2022-03-09 14:25:22 -08:00
Changpeng Fang
0f20a35b9e AMDGPU: Set up User SGPRs for queue_ptr only when necessary
Summary:
  In general, we need queue_ptr for aperture bases and trap handling,
and user SGPRs have to be set up to hold queue_ptr. In current implementation,
user SGPRs are set up unnecessarily for some cases. If the target has aperture
registers, queue_ptr is not needed to reference aperture bases. For trap
handling, if target suppots getDoorbellID, queue_ptr is also not necessary.
Futher, code object version 5 introduces new kernel ABI which passes queue_ptr
as an implicit kernel argument, so user SGPRs are no longer necessary for
queue_ptr. Based on the trap handling document:
https://llvm.org/docs/AMDGPUUsage.html#amdgpu-trap-handler-for-amdhsa-os-v4-onwards-table,
llvm.debugtrap does not need queue_ptr, we remove queue_ptr suport for llvm.debugtrap
in the backend.

Reviewers: sameerds, arsenm

Fixes: SWDEV-307189

Differential Revision: https://reviews.llvm.org/D119762
2022-03-09 10:14:05 -08:00
Stanislav Mekhanoshin
33fb23f728 [AMDGPU] Merge flat with global in the SILoadStoreOptimizer
Flat can be merged with flat global since address cast is a no-op.
A combined memory operation needs to be promoted to flat.

Differential Revision: https://reviews.llvm.org/D120431
2022-03-09 10:04:37 -08:00
Vang Thao
28322c2514 [AMDGPU] Add scheduler pass to rematerialize trivial defs
Add a new pass in the pre-ra AMDGPU scheduler to check if sinking trivially rematerializable defs that only has one use outside of the defining block will increase occupancy. If we can determine that occupancy can be increased, then rematerialize only the minimum amount of defs required to increase occupancy. Also re-schedule all regions that had occupancy matching the previous min occupancy using the new occupancy.

This is based off of the discussion in https://reviews.llvm.org/D117562.

The logic to determine the defs we should collect and determining if sinking would be beneficial is mostly the same. Main differences is that we are no longer limiting it to immediate defs and the def and use does not have to be part of a loop.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D119475
2022-03-09 09:34:33 -08:00
Jay Foad
28f67aed9d [AMDGPU] Fix some confusing check prefixes. NFC.
Tahiti is SI/GFX6.
Kaveri and Hawaii are CI/GFX7.
Fiji is VI/GFX8.
2022-03-09 17:05:49 +00:00
Venkata Ramanaiah Nalamothu
04fff547e2 [AMDGPU] Move call clobbered return address registers s[30:31] to callee saved range
Currently the return address ABI registers s[30:31], which fall in the call
clobbered register range, are added as a live-in on the function entry to
preserve its value when we have calls so that it gets saved and restored
around the calls.

But the DWARF unwind information (CFI) needs to track where the return address
resides in a frame and the above approach makes it difficult to track the
return address when the CFI information is emitted during the frame lowering,
due to the involvment of understanding the control flow.

This patch moves the return address ABI registers s[30:31] into callee saved
registers range and stops adding live-in for return address registers, so that
the CFI machinery will know where the return address resides when CSR
save/restore happen during the frame lowering.

And doing the above poses an issue that now the return instruction uses undefined
register `sgpr30_sgpr31`. This is resolved by hiding the return address register
use by the return instruction through the `SI_RETURN` pseudo instruction, which
doesn't take any input operands, until the `SI_RETURN` pseudo gets lowered to the
`S_SETPC_B64_return` during the `expandPostRAPseudo()`.

As an added benefit, this patch simplifies overall return instruction handling.

Note: The AMDGPU CFI changes are there only in the downstream code and another
version of this patch will be posted for review for the downstream code.

Reviewed By: arsenm, ronlieb

Differential Revision: https://reviews.llvm.org/D114652
2022-03-09 12:18:02 +05:30
Arthur Eubanks
b81d5baa0f [test] Use new PM for -aa-eval tests 2022-03-08 14:15:53 -08:00
Stanislav Mekhanoshin
9eabea3968 [AMDGPU] Set noclobber metadata on loads instead of cast to constant
A load via pointer cast to constant will return true from
pointsToConstantMemory which is not necessarily so.

Fixes: SWDEV-326463

Differential Revision: https://reviews.llvm.org/D121172
2022-03-07 23:13:02 -08:00
Christudasan Devadasan
0d849b8249 AMDGPU: Skip folding REG_SEQUENCE if found unknown regclasses for its users
Use TII::getRegClass to return a valid regclass or a nullptr
if the RC is unknown for a given OpIdx. This fixes a potential
crash occurred while getting the RC from a variadic instruction.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D120813
2022-03-08 10:11:57 +05:30
Stanislav Mekhanoshin
932f628121 [AMDGPU] new gfx940 fp atomics
Differential Revision: https://reviews.llvm.org/D121028
2022-03-07 12:32:02 -08:00
Stanislav Mekhanoshin
e7b362d75d [AMDGPU] Add v_mov_b64 gfx940 opcode
Differential Revision: https://reviews.llvm.org/D121023
2022-03-07 12:07:12 -08:00