6171 Commits

Author SHA1 Message Date
Amara Emerson
b309bc04ee [GlobalISel] Combine out-of-range shifts to undef.
Differential Revision: https://reviews.llvm.org/D144303
2023-02-17 15:05:00 -08:00
Nick Desaulniers
a3a84c9e25 [llvm] add CallBrPrepare pass to pipelines
Capstone of
https://discourse.llvm.org/t/rfc-syncing-asm-goto-with-outputs-with-gcc/65453/8

Clang changes are still necessary to enable the use of outputs along
indirect edges of asm goto statements.

Link: https://github.com/llvm/llvm-project/issues/53562

Reviewed By: void

Differential Revision: https://reviews.llvm.org/D140180
2023-02-16 17:58:34 -08:00
Jay Foad
8a17cd9905 AMDGPU: Add a regression test case for D143963 2023-02-16 17:11:32 +00:00
Jay Foad
8e5a41e827 Revert "AMDGPU: Override getNegatedExpression constant handling"
This reverts commit 11c3cead23783e65fb30e673d62771352078ff05.

It was causing infinite loops in the DAG combiner.
2023-02-16 17:11:32 +00:00
Jay Foad
9305b63d69 [AMDGPU] Add another G_UNMERGE_VALUES legalization test case 2023-02-16 16:45:35 +00:00
Florian Hahn
2ac85cd563
[AMDGPU] Regenerate check lines to enable updating for D144050. 2023-02-16 16:38:15 +00:00
Diana Picus
819dfc338b [AMDGPU] Autogenerate checks for several tests. NFCI 2023-02-16 10:54:34 +01:00
Matt Arsenault
11c3cead23 AMDGPU: Override getNegatedExpression constant handling
Ignore the multiple use heuristics of the default
implementation, and report cost based on inline immediates. This
is mostly interesting for -0 vs. 0. Gets a few small improvements.
fneg_fadd_0_f16 is a small regression. We could probably avoid this
if we handled folding fneg into div_fixup.
2023-02-15 05:21:00 -04:00
Stanislav Mekhanoshin
12b4f9e2af [AMDGPU] Do not apply schedule metric for regions with spilling
D139710 has added a metric to increase schedule's ILP while
staying within the same occupancy. Do not bother to apply this
metric to a region which is known to have spilling, it may result
in spilling to reappear after the previous stage and will do no
good if we already spilling anyway. It may also reduce compile
time a bit for such regions.

Fixes: SWDEV-377300

Differential Revision: https://reviews.llvm.org/D143934
2023-02-14 12:16:46 -08:00
Matt Arsenault
09dd4d870e DAG: Remove hasBitPreservingFPLogic
This doesn't make sense as an option. fneg and fabs are bit
preserving by definition. If a target has some fneg or fabs
instruction that are not bitpreserving it's incorrect to lower
fneg/fabs to use it.
2023-02-14 10:25:24 -04:00
Matt Arsenault
f3c008ca77 DAG: Relax foldBitcastedFPLogic conditions
Requiring a bitcast to exist was unhelpful. The most basic cases
are always going to be a CopyFromReg or load, so they would need
a new cast inserted. Don't require a bitcast if it's a free
operation. I don't think this logic makes particularly much sense
(it seems to be imparting special interpretation of bitcast), but
this needs to be in sync with foldSignChangeInBitcast.

We should also get rid of this hasBitPreservingFPLogic hook. fabs/fneg
are bitpreserving or incorrectly implemented, so this should just be a
regular legality check.
2023-02-14 07:59:10 -04:00
Matt Arsenault
4f0eb57222 AMDGPU: Teach getNegatedExpression about rcp 2023-02-14 04:02:39 -04:00
Matt Arsenault
ce4b719f33 AMDGPU: Add test for getNegatedExpression with rcp 2023-02-14 04:02:39 -04:00
Matt Arsenault
0a669bd894 AMDGPU: Add additional tests for combiner infinite loop 2023-02-14 04:02:38 -04:00
pvanhout
04f6934589 [DAG] Handle build_vector with all undefs in reduceBuildVecTruncToBitCast
While working on D143731 I hit a case where a build_vector with 2 undef operands could be generated (with one undef hidden behind a bitcast).
That made `reduceBuildVecTruncToBitCast` crash because it seems to assume there is at least one good operand.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D143886
2023-02-14 08:52:28 +01:00
Arthur Eubanks
7c6b46e87e Revert "[DAGCombiner] handle more store value forwarding"
This reverts commit f35a09daebd0a90daa536432e62a2476f708150d.

Causes miscompiles, see D138899
2023-02-13 19:07:28 -08:00
Christudasan Devadasan
1c9e6238fe [AMDGPU] Allow architected SGPRs for workgroup IDs
Some subtargets use architected SGPRs for workgroup
IDs instead of the regular SGPRs. This patch enables
the support for the same and is guarded under the
subtarget feature FeatureArchitectedSGPRs.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D143707
2023-02-13 22:11:35 +05:30
Changpeng Fang
7ca3444fba AMDGPU: Use module flag to get code object version at IR level folow-up
Summary:
  This is part of the leftover work for https://reviews.llvm.org/D143138.
In this work, we pass code object version as an argument to initialize target ID
and use it for targetID dump.

Reviewers: arsenm

Differential Revision
  https://reviews.llvm.org/D143293
2023-02-10 11:16:38 -08:00
Pierre van Houtryve
70924673af [RFC][GISel] Add a way to ignore COPY instructions in InstructionSelector
RFC to add a way to ignore COPY instructions when pattern-matching MIR in GISel.
    - Add a new "GISelFlags" class to TableGen. Both `Pattern`  and `PatFrags` defs can use it to alter matching behaviour.
    - Flags start at zero and are scoped: the setter returns a `SaveAndRestore` object so that when the current scope ends, the flags are restored to their previous values. This allows child patterns to modify the flags without affecting the parent pattern.
    - Child patterns always reuse the parent's pattern, but they can override its values. For more examples, see `GlobalISelEmitterFlags.td` tests.
    - [AMDGPU] Use the IgnoreCopies flag in BFI patterns, which are known to be bothered by cross-regbank copies.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D136234
2023-02-10 08:37:42 +01:00
Pierre van Houtryve
d9a6fc82f5 [AMDGPU] Run unmerge combines post regbankselect
RegBankSelect can insert G_UNMERGE_VALUES in a lot of places which
left us with a lot of unmerge/merge pairs that could be simplified.
These often got in the way of pattern matching and made codegen
worse.

This patch:
  - Makes the necessary changes to the merge/unmerge combines so they can run post RegBankSelect
  - Adds relevant unmerge combines to the list of RegBankSelect combines for AMDGPU
  - Updates some tablegen patterns that were missing explicit cross-regbank copies (V_BFI patterns were causing constant bus violations with this change).

This seems to be mostly beneficial for code quality.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D142192
2023-02-10 08:34:23 +01:00
Andrew Savonichev
c65b4d64d4 [SelectionDAG] Do not second-guess alignment for alloca
Alignment of an alloca in IR can be lower than the preferred alignment
on purpose, but this override essentially treats the preferred
alignment as the minimum alignment.

The patch changes this behavior to always use the specified
alignment. If alignment is not set explicitly in LLVM IR, it is set to
DL.getPrefTypeAlign(Ty) in computeAllocaDefaultAlign.

Tests are changed as well: explicit alignment is increased to match
the preferred alignment if it changes output, or omitted when it is
hard to determine the right value (e.g. for pointers, some structs, or
weird types).

Differential Revision: https://reviews.llvm.org/D135462
2023-02-09 18:45:20 +03:00
Mirko Brkusanin
43924cbd29 [AMDGPU][GlobalISel] Fix selection of image sample g16 instructions
Pre-GFX10 A16 modifier would imply G16. From GFX10 and onwards there are
separate instructions for 16bit gradients. This fixes the condition for
selecting G16 opcodes. Also stop adding G16 flag to instructions that do not
use gradients for GFX10 onwards.
2023-02-09 16:26:55 +01:00
Stanislav Mekhanoshin
94def1b44e [AMDGPU] Do not exapnd fp atomics on gfx940
FP atomics are safe on gfx940. This fixes regression after D131560.

Fixes: SWDEV-380468

Differential Revision: https://reviews.llvm.org/D143603
2023-02-08 13:22:04 -08:00
Stanislav Mekhanoshin
3e9f2af27a [AMDGPU] Update atomic tests. NFC.
This is to precommit tests before future patch.
2023-02-08 12:55:34 -08:00
Jay Foad
805da0e298 [AMDGPU] Fix some LABEL check lines 2023-02-06 15:53:13 +00:00
Jay Foad
785cc4d71a [AMDGPU] Fix DOS line endings in some tests 2023-02-06 15:53:13 +00:00
Ruiling Song
be3f4591af AMDGPU: Mark control flow intrinsics non-duplicable
This is used to help get simplified CFG for divergent regions as well as
get better code generation in some cases.

For example, with below IR:
```
define amdgpu_kernel void @test() {
bb:
  br label %bb1

bb1:
  %tmp = phi i32 [ 0, %bb ], [ %tmp5, %bb4 ]
  %tid = call i32 @llvm.amdgcn.workitem.id.x()
  %cnd = icmp eq i32 %tid, 0
  br i1 %cnd, label %bb4, label %bb2

bb2:
  %tmp3 = add nsw i32 %tmp, 1
  br label %bb4

bb4:
  %tmp5 = phi i32 [ %tmp3, %bb2 ], [ %tmp, %bb1 ]
  store volatile i32 %tmp5, ptr addrspace(1) undef
  br label %bb1
}
```

We got below assembly before the change:
```
  v_mov_b32_e32 v1, 0
  v_cmp_eq_u32_e32 vcc, 0, v0
  s_branch .LBB0_2
.LBB0_1:                                ; %bb4
                                        ;   in Loop: Header=BB0_2 Depth=1
  s_mov_b32 s2, -1
  s_mov_b32 s3, 0xf000
  buffer_store_dword v1, off, s[0:3], 0
  s_waitcnt vmcnt(0)
.LBB0_2:                                ; %bb
                                        ; =>This Inner Loop Header: Depth=1
  s_and_saveexec_b64 s[0:1], vcc
  s_xor_b64 s[0:1], exec, s[0:1]
                                        ; kill: def $sgpr0_sgpr1 killed $sgpr0_sgpr1 killed $exec
  s_cbranch_execnz .LBB0_1
; %bb.3:                                ; %bb2
                                        ;   in Loop: Header=BB0_2 Depth=1
  s_or_b64 exec, exec, s[0:1]
  s_waitcnt expcnt(0)
  v_add_i32_e64 v1, s[0:1], 1, v1
  s_branch .LBB0_1
```

After the change:
```
  s_mov_b32 s0, 0
  v_cmp_eq_u32_e32 vcc, 0, v0
  s_mov_b32 s2, -1
  s_mov_b32 s3, 0xf000
  v_mov_b32_e32 v0, s0
  s_branch .LBB0_2
.LBB0_1:                                ; %bb4
                                        ;   in Loop: Header=BB0_2 Depth=1
  buffer_store_dword v0, off, s[0:3], 0
  s_waitcnt vmcnt(0)
.LBB0_2:                                ; %bb1
                                        ; =>This Inner Loop Header: Depth=1
  s_and_saveexec_b64 s[0:1], vcc
  s_cbranch_execnz .LBB0_1
; %bb.3:                                ; %bb2
                                        ;   in Loop: Header=BB0_2 Depth=1
  s_or_b64 exec, exec, s[0:1]
  s_waitcnt expcnt(0)
  v_add_i32_e64 v0, s[0:1], 1, v0
  s_branch .LBB0_1
```

We are using one less VGPR, one less s_xor_, and better LICM with one
additional branch after the change. Please note the experiment
was done with reverting the workaround D139780, as it will stop the
tail-duplication completely for this case.

Reviewed by: arsenm

Differential Revision: https://reviews.llvm.org/D118250
2023-02-06 15:32:44 +08:00
Stanislav Mekhanoshin
dd0caa82de [AMDGPU] Fix liveness in the SIOptimizeExecMaskingPreRA.cpp
If a condition register def happens past the newly created use
we do not properly update LIS. It has two problems:

1) We do not extend defining segment to the end of its block
   marking it a live-out (this is regression after
   https://reviews.llvm.org/rG09d38dd7704a52e8ad2d5f8f39aaeccf107f4c56)

2) We do not extend use segment to the beginning of the use block
   marking it a live-in.

Fixes: SWDEV-379563

Differential Revision: https://reviews.llvm.org/D143302
2023-02-05 12:21:28 -08:00
Matt Arsenault
6ce86a7eff AMDGPU: Ensure flat loads are broken into dword in functions
We were assuming we could rely on the flat scratch init detection
to imply if there are possible flat addressed stack objects, which
doesn't work outside of a kernel. We should have a way to prove
if a given flat access can't access the stack.

We could use a not-stack parameter attribute to avoid
these splits.

Make the minimally correct change for GlobalISel; I'll address
this better in my larger patch to rewrite load and store legalization.

Fixes: SWDEV-218237
2023-02-05 05:25:15 -04:00
Matt Arsenault
d53699cc45 AMDGPU: Add some regression tests that infinite looped combiner
Prevent a future patch from introducing an infinite combine loop.
2023-02-04 08:21:18 -04:00
Matt Arsenault
a7fad92ba8 AMDGPU: Add more tests to fneg modifier with casting tests 2023-02-03 07:28:47 -04:00
Changpeng Fang
54cf69c9d5 AMDGPU: Use module flag to get code object version at IR level
Summary:
  This patch introduces a mechanism to check the code object version from the module flag, This avoids checking from command line.
In case the module flag is missing, we use the current default code object version supported in the compiler.

For tools whose inputs are not IR, we may need other approach (directive, for example) to check the code
object version, That will be in a separate patch later.

For LIT tests update, we directly add module flag if there is only a single code object version associated with all checks in one file.
In cause of multiple code object version in one file, we use the "sed" method to "clone" the checks to achieve the goal.

Reviewer: arsenm

Differential Revision:
  https://reviews.llvm.org/D14313
2023-02-02 18:57:26 -08:00
Matt Arsenault
0f8b3b97fd AMDGPU: Add additional tests for is.fpclass legalization 2023-02-02 22:50:23 -04:00
Matt Arsenault
3a01b4a93d AMDGPU: Regenerate test checks
Use right prefix order to get merging.

Also drop -verify-machineinstrs and add -amdgpu-enable-delay-alu=0
2023-02-02 22:50:23 -04:00
Matt Arsenault
36cfe26a52 AMDGPU: Try to unfold fneg source when matching legacy fmin/fmax
This is NFC as it stands, since other combines will effectively
prevent this from being reachable. This will avoid regressions in a
future change which tries to make better use of select source
modifiers.

Didn't bother with the GlobalISel part for now, since the baseline
combine doesn't seem to work on the existing test.
2023-02-02 22:50:23 -04:00
Chen Zheng
f35a09daeb [DAGCombiner] handle more store value forwarding
When lowering calls on target like PPC, some stack loads
will be generated for by value parameters. Node CALLSEQ_START
prevents such loads from being combined.

Suggested by @RolandF, this patch removes the unnecessary
loads for the byval parameter by extending ForwardStoreValueToDirectLoad

Reviewed By: nemanjai, RolandF

Differential Revision: https://reviews.llvm.org/D138899
2023-02-01 21:06:17 -05:00
Matt Arsenault
e9c49901a4 AMDGPU/GlobalISel: Add stub custom regbankselect pass
Uniformity analysis needs to be the fundamental basis for
regbank decisions. The considerations of the default pass
are secondary, but potentially useful for some edge cases (e.g.
selecting AGPRs when arbitrary loads and stores can directly use
them). This needs to be a separate pass since it requires new
analysis dependencies.

Boilerplate to subclass the existing pass which does nothing
different.
2023-01-30 16:18:20 -04:00
Matt Arsenault
68d4656722 AMDGPU: Don't insert pointer bitcasts for printf lowering
Cleanup leftover typed pointer handling.
2023-01-28 21:49:10 -04:00
Matt Arsenault
9a22aeb91d Attributes: Check declarations for dereferenceable bytes
This will allow tablegen to start directly marking intrinsics as
dereferenceable in a useful way. Not sure if callsites should override
or use the max.
2023-01-28 20:40:23 -04:00
Matt Arsenault
d4299b3825 AMDGPU: Convert fcopysign tests to generated checks and add cases 2023-01-28 07:57:28 -04:00
Matt Arsenault
8e6406c2ce AMDGPU: Add fneg and select test 2023-01-28 07:57:28 -04:00
Matt Arsenault
93ec3fa402 AMDGPU: Support atomicrmw uinc_wrap/udec_wrap
For now keep the exising intrinsics working.
2023-01-27 22:17:16 -04:00
Matt Arsenault
d416f7e7f1 AMDGPU: Add fneg from integer tests 2023-01-25 22:38:53 -04:00
Matt Arsenault
881194d47c AMDGPU: Remove redundant test
implicit-arg-v5-opt.ll already covers this with more cases.
2023-01-25 20:14:04 -04:00
Matt Arsenault
5a4a8eb2b6 AMDGPU: Convert some tests to opaque pointers 2023-01-25 14:33:20 -04:00
Austin Kerbow
913837eaa3 [ScheduleDAG] Fix removing edges with weak deps
In SUnit::removePred edges are removed from the Preds and Succs lists before
updating the bookkeeping. This could result in incorrect values for
NumPreds/SuccsLeft and cause WeakPreds/SuccsLeft to underflow, since the
incorrect SDep will be used to update these values.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D142325
2023-01-25 10:05:50 -08:00
Matt Arsenault
778cf5431c IR: Add atomicrmw uinc_wrap and udec_wrap
These are essentially add/sub 1 with a clamping value.

AMDGPU has instructions for these. CUDA/HIP expose these as
atomicInc/atomicDec. Currently we use target intrinsics for these,
but those do no carry the ordering and syncscope. Add these to
atomicrmw so we can carry these and benefit from the regular
legalization processes.
2023-01-24 17:55:11 -04:00
Stanislav Mekhanoshin
2968380717 [AMDGPU] Add missing gfx11 tests in the directive-amdgcn-target.ll. NFC. 2023-01-24 09:58:02 -08:00
Yashwant Singh
2a832d0f09 [AMDGPU] Add missing physical register check in SIFoldOperands::tryFoldLoad
tryFoldLoad() is not meant to work on physical registers moreover
use_nodbg_instructions(reg) makes the compiler buggy when called with
physical reg

Fix for SWDEV-373493

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D141895
2023-01-24 14:24:41 +05:30
Nicolai Hähnle
10cef708a7 AMDGPU: Clean up LDS-related occupancy calculations
Occupancy is expressed as waves per SIMD. This means that we need to
take into account the number of SIMDs per "CU" or, to be more precise,
the number of SIMDs over which a workgroup may be distributed.

getOccupancyWithLocalMemSize was wrong because it didn't take SIMDs
into account at all.

At the same time, we need to take into account that WGP mode offers
access to a larger total amount of LDS, since this can affect how
non-power-of-two LDS allocations are rounded. To make this work
consistently, we distinguish between (available) local memory size and
addressable local memory size (which is always limited by 64kB on
gfx10+, even with WGP mode).

This change results in a massive amount of test churn. A lot of it is
caused by the fact that the default work group size is 1024, which means
that (due to rounding effects) the default occupancy on older hardware
is 8 instead of 10, which affects scheduling via register pressure
estimates. I've adjusted most tests by just running the UTC tools, but
in some cases I manually changed the work group size to 32 or 64 to make
sure that work group size chunkiness has no effect.

Differential Revision: https://reviews.llvm.org/D139468
2023-01-23 21:43:06 +01:00