1698 Commits

Author SHA1 Message Date
Petar Avramovic
cbc378ecb8 GlobalISel: Artifact combine merge-like and unmerges into merge-like
Recognize when sub-vectors have been split to elements which are used to
build large vector.
This happens when instructions have different vector sizes available.
For example a few arithmetic instruction are required to process all
elements of larger vector that can be stored using one instruction.

Differential Revision: https://reviews.llvm.org/D109242
2022-10-24 13:33:06 +02:00
Petar Avramovic
e6c778f861 GlobalISel: Artifact combine merge-like and unmerge into unmerge
Recognize when source could have been unmerged to pieces with DstTy
without having to split source to smaller elements
and then merge small elements into DstTy pieces.
This happens when vector was meant to be split to sub-vectors but there
was leftover. At this point artifact combiner have already dealt with
leftover and we can continue to use sub-vectors.

Differential Revision: https://reviews.llvm.org/D109241
2022-10-24 13:33:05 +02:00
Petar Avramovic
f1aa598046 GlobalISel: Artifact combine merge-like and unmerge into copy
Recognize copy that is represented as split of a source register to
elements that were reassembled to another register with the same type.

Differential Revision: https://reviews.llvm.org/D109240
2022-10-24 13:33:05 +02:00
Petar Avramovic
51b98db487 GlobalISel: Precommit for artifact combine patches
Differential Revision: https://reviews.llvm.org/D117655
2022-10-24 13:33:05 +02:00
Pierre van Houtryve
ed5fe7f3a1 [AMDGPU][GISel] Re-enable some working tests
These tests had been commented out but seem to not be crashing.
Not sure if codegen is perfect in each of them, but even if it's not I think it's better to put a TODO to fix codegen than remove the test outright, unless codegen is plain wrong (then I'd still rather XFAIL rather than hide it)

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D136341
2022-10-21 06:39:40 +00:00
Peter Rong
c2e7c9cb33 [CodeGen] Using ZExt for extractelement indices.
In https://github.com/llvm/llvm-project/issues/57452, we found that IRTranslator is translating `i1 true` into `i32 -1`.
This is because IRTranslator uses SExt for indices.

In this fix, we change the expected behavior of extractelement's index, moving from SExt to ZExt.
This change includes both documentation, SelectionDAG and IRTranslator.
We also included a test for AMDGPU, updated tests for AArch64, Mips, PowerPC, RISCV, VE, WebAssembly and X86

This patch fixes issue #57452.

Differential Revision: https://reviews.llvm.org/D132978
2022-10-15 15:45:35 -07:00
Jessica Paquette
0f1a51e173 [GlobalISel] Allow vectors in redundant or + add combines
We support KnownBits for vectors, so we can enable these.

https://godbolt.org/z/r9a9W4Gj1

Differential Revision: https://reviews.llvm.org/D135719
2022-10-11 15:31:09 -07:00
Pierre van Houtryve
4d815bfae0 [GISel] Add redundant bitcast folding combine
Simply folds away bitcasts that cancel each other.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D135146
2022-10-11 15:03:08 +00:00
Weining Lu
42b70793a1 Reland "[Clang][LoongArch] Add inline asm support for constraints k/m/ZB/ZC"
Reference: https://gcc.gnu.org/onlinedocs/gccint/Machine-Constraints.html

k: A memory operand whose address is formed by a base register and
(optionally scaled) index register.

m: A memory operand whose address is formed by a base register and
offset that is suitable for use in instructions with the same
addressing mode as st.w and ld.w.

ZB: An address that is held in a general-purpose register. The offset
is zero.

ZC: A memory operand whose address is formed by a base register and
offset that is suitable for use in instructions with the same
addressing mode as ll.w and sc.w.

Note:
The INLINEASM SDNode flags in below tests are updated because the new
introduced enum `Constraint_k` is added before `Constraint_m`.
  llvm/test/CodeGen/AArch64/GlobalISel/irtranslator-inline-asm.ll
  llvm/test/CodeGen/AMDGPU/GlobalISel/irtranslator-inline-asm.ll
  llvm/test/CodeGen/X86/callbr-asm-kill.mir

This patch passes `ninja check-all` on a X86 machine with all official
targets and the LoongArch target enabled.

Differential Revision: https://reviews.llvm.org/D134638
2022-10-11 19:51:48 +08:00
Pierre van Houtryve
36c3833783 [GISel] Add Trunc/Lshr/BuildVector Folding
Similar to the current "Trunc/BuildVector" folding - which folds low element extracts of BuildVectors, folds hi element extracts done using bitshifts.

For D134354

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D135148
2022-10-07 08:44:03 +00:00
Pierre van Houtryve
bb71079e30 [AMDGPU][GISel] Add missing V2S16 BUILD_VECTOR_TRUNC legalization
Previously we would be unable to legalize V2S16 BUILD_VECTOR_TRUNC on GFX8 & below as the custom legalization was missing.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D135149
2022-10-06 06:48:53 +00:00
Carl Ritson
c316332e17 [Sink] Allow sinking of invariant loads across critical edges
Invariant loads can always be sunk.

Reviewed By: foad, arsenm

Differential Revision: https://reviews.llvm.org/D135133
2022-10-06 09:21:12 +09:00
Pierre van Houtryve
c93104073c [AMDGPU] Always lower SHUFFLE_VECTOR
Make it illegal, remove InstructionSelector logic for it

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D134967
2022-10-04 14:23:17 +00:00
jeff
f4e6149d82 [AMDGPU] Use V_PERM to match buildvectors when inputs are not canonicalized (i.e. can't use V_PACK)
If we can not prove that f16 operands of a buildvector are canonicalized, then we can not lower into a V_PACK. In this scenario, we would previously lower into some combination of and(sdwa), shr, or. This patch allows for matching into V_PERM instead.

Change-Id: Ifa4a74fdb81ef44f22ba490c7fdf81ec8aebc945
2022-10-03 12:58:29 -07:00
Pierre van Houtryve
d8258508d4 [AMDGPU][GISel] Update isCanonicalized
Recognize more opcodes in the function.
Fixes some regressions introduced in D134857 for fdiv.f16 too.

Depends on D134857

Reviewed By: arsenm, foad

Differential Revision: https://reviews.llvm.org/D134862
2022-09-30 14:13:35 +00:00
Pierre van Houtryve
7388520d1c [GISel] Add more cases to isKnownNeverNaN
Make it even with the DAG implementation as of D134854

Reviewed By: arsenm, foad

Differential Revision: https://reviews.llvm.org/D134857
2022-09-30 14:10:56 +00:00
Pierre van Houtryve
653beae5a1 [AMDGPU][GISel] Add Identity BUILD_VECTOR Combines
Folds-away BUILD_VECTOR-related noops in the post-legalizer combiner.

Depends on D134433

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D134953
2022-09-30 14:07:13 +00:00
Pierre van Houtryve
9a67a6b72a [AMDGPU][GISel] Legalize V2S16 G_BUILD_VECTOR
Preparation patch for D134354 to make V2S16 G_BUILD_VECTOR legal.
Also removes RegBankInfo's scalarization of small BUILD_VECTORs,
replacing it with InstructionSelector logic instead.

This allows for V2S16 BUILD_VECTOR instructions to survive
all the way to ISel so we can select FMA/MAD_MIX instructions
in D134354.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D134433
2022-09-30 14:04:53 +00:00
Jessica Paquette
1eb49bbab6 [GlobalISel][CallLowering] Use hasRetAttr for return flags on CallBases
Given something like this:

```
declare signext i16 @signext_callee()
define i32 @caller() {
  %res = call i16 @signext_callee()
  ...
}
```

CallLowering would miss that signext_callee's return value is sign extended,
because it isn't on the call.

Use hasRetAttr on the CallBase to allow us to catch this.

(This now inserts G_ASSERT_SEXT/G_ASSERT_ZEXT like in the original review.)

Differential Revision: https://reviews.llvm.org/D86228
2022-09-28 19:38:24 -07:00
Jay Foad
ddfa0f62d8 [AMDGPU] Add GFX11 feature for subtargets with more VGPRs
The full complement of physical VGPRs for GFX11 is 50% more than GFX10.
Some subtargets have this, others stay the same as GFX10. This affects
occupancy calculations.

Differential Revision: https://reviews.llvm.org/D134522
2022-09-23 20:18:23 +01:00
Petar Avramovic
6db7921b65 AMDGPU: Use tablegen patterns for buffer global and flat atomic fadd
Remove manual selection for atomic fadd from global-isel.
Stop pre-isel translation to AtomicLoadFAdd/G_ATOMICRMW_FADD
which corresponds to llvm-ir's atomicrmw fadd instruction.

global and flat atomic fadd patterns changes:
Split rtn/no-rtn patterns
Add missing patterns or fix predicates
Remove atomicrmw patterns for v2f16 (atomic rmw doesn't support vectors).
Patterns now check addrspace of pointer, added patterns for flat intrinsic.
with global addrspace pointer that selects into global atomic instruction.

buffer atomic fadd patterns changes:
Rdit patterns to import into global-isel.
Remove gfx6/gfx7 _addr64 and _offset patterns.
Remove patterns that can't be reached (same pattern but different feature).

Differential Revision: https://reviews.llvm.org/D130579
2022-09-23 17:52:10 +02:00
Petar Avramovic
5cee9047d5 AMDGPU: Improve atomicrmw fadd selection
Use same atomicrmw fadd expansion rules for gfx908, gfx940 and gfx11
as for gfx90a. Add missing globalisel legalizer support for flat
atomicrmw fadd f32 on gfx940 and gfx11.
Isel support for gfx11 will be added in D130579.

Differential Revision: https://reviews.llvm.org/D131560
2022-09-23 17:52:10 +02:00
Petar Avramovic
48968c47b0 AMDGPU: Add detailed buffer, global and flat atomic fadd tests
Precommit for D130579 that will remove manual selection and use
patterns from td files. Tests are grouped based on target features.

All patterns have rtn and no-rtn versions.

buffer atomics patterns are selected based on the intrinsic used
(raw or struct) and the offset operand (imm or vgpr):
_offset raw with imm offset
_offen raw with vgpr offset (or large imm offset)
_idxen struct with imm offset
_bothen struct with vgpr offset (or large imm offset)

global and flat atomics are selected via intrinsic or the atomicrmw fadd.
atomicrmw tests have amdgpu-unsafe-fp-atomics=true and non-system scope
since they get expanded otherwise. atomicrmw fadd does not support vector
type, test float and double.

global atomics patterns are selected based on address type via (global or
flat) intrinsic or atomicrmw fadd with global address(addrspace(1)*).
'no suffix' vgpr addrspace(1)* address
_saddr sgpr addrspace(1)* address

flat atomics patterns are selected via (flat)intrinsic or atomicrmw fadd
with flat address (* - address space 0).

Differential Revision: https://reviews.llvm.org/D131561
2022-09-23 17:52:10 +02:00
Joe Nash
b982ba2a6e [AMDGPU][GFX11] Use VGPR_32_Lo128 for VOP1,2,C
Due to the encoding changes in GFX11, we had a hack in place that
    disables the use of VGPRs above 128. This patch removes the need for
    that hack.

    We introduce a new register class VGPR_32_Lo128 which is used for 16-bit
    operands of VOP1, VOP2, and VOPC instructions. This register class only has the
    low 128 VGPRs, but is otherwise identical to VGPR_32. Therefore, 16-bit VOP1,
    VOP2, and VOPC instructions are correctly limited to use the first 128
    VGPRs, while the other instructions can freely use all 256.

    We introduce new pseduo-instructions used on GFX11 which have the suffix
    t16 (True 16) to use the VGPR_32_Lo128 register class.

Reviewed By: foad, rampitec, #amdgpu

Differential Revision: https://reviews.llvm.org/D133723
2022-09-20 09:56:28 -04:00
Matt Arsenault
69153d6c0a AMDGPU: Use GlobalPriority for largest register tuples
Only do this for 16 and 32 register tuples, although we might want to
extend to 8 tuples.

It's incredibly expensive to spill these, and doing so majorly
interferes with the ability to allocate anything else in the function.

The lit tests show mostly sizeable improvements with a handful of tiny
regressions with large vectors.
2022-09-15 11:45:02 -04:00
Jay Foad
3743f9afeb [AMDGPU] Add GFX11 globalisel test coverage for fptosi/fptoui 2022-09-13 10:51:02 +01:00
Jay Foad
210e6a993d [GlobalISel] Simplify extended add/sub to add/sub with carry
Simplify extended add/sub (with carry-in and carry-out) to add/sub with
carry (with carry-out only) if carry-in is known to be zero.

Differential Revision: https://reviews.llvm.org/D133702
2022-09-12 17:05:44 +01:00
Joe Nash
8604904e68 [AMDGPU] Separate check lines for some GFX11 16-bit codegen tests
NFC. Pre-commits test changes to have a separate CHECK line where GFX11 behavior will diverge from
previous subtargets in a future patch.
2022-09-12 09:38:34 -04:00
Matt Arsenault
bb70b5d406 CodeGen: Set MODereferenceable from isDereferenceableAndAlignedPointer
Previously this was assuming piontsToConstantMemory implies
dereferenceable.
2022-09-12 08:38:35 -04:00
Jay Foad
8901f7cebc [AMDGPU] Fix crash legalizing G_EXTRACT_VECTOR_ELT with negative index
Fixes https://github.com/llvm/llvm-project/issues/57408

Differential Revision: https://reviews.llvm.org/D132938
2022-09-09 15:53:34 +01:00
Justin Bogner
a81c7dbf0d [AMDGPU] Drop _oneuse checks from med3 patterns
We use _oneuse checks to make sure combines won't accidentally
increase code size, but this prevents the optimization in cases where
we happen to want to clamp multiple values to the same range

It's safe to drop these checks for two reasons:

1. The pattern of max/min operations for med3 is complicated enough
   it's unlikely to come up by accident, so this will still only fire
   when appropriate to do so
2. Even if every intermediate is used and we don't save a single
   operation, we still won't end up with more operations since the
   med3 replaces the final max/min.

In pathological cases we could potentially end up with a larger
encoding size or possibly slightly increased vgpr pressure, but the
risk of that is low, especially considering the upside.

Differential Revision: https://reviews.llvm.org/D132621
2022-09-07 16:31:49 -07:00
Justin Bogner
f9433161f5 [AMDGPU] Precommit two tests showing missed combines to v_med3 2022-08-30 11:56:09 -07:00
Pierre van Houtryve
59cf9dd923 [AMDGPU][GISel] Enable Selection of ADD3 for G_PTR_ADD
Allows things like `(G_PTR_ADD (G_PTR_ADD a, b), c)` to be
simplified into a single ADD3 instruction instead of two adds.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D131254
2022-08-24 14:44:19 +00:00
Luo, Yuanke
5159be3c9b (Reland) [fastalloc] Support allocating specific register class in fastalloc
This reverts commit 853bb192c407f5d9e75a5fd55cc089151530cbd3.
2022-08-20 13:25:34 +08:00
Luo, Yuanke
853bb192c4 Revert "(Reland) [fastalloc] Support allocating specific register class in fastalloc"
This reverts commit 30f9e6ebd30b79d13f99eaca4d829e0da07186b3.
2022-08-15 20:33:15 +08:00
Luo, Yuanke
30f9e6ebd3 (Reland) [fastalloc] Support allocating specific register class in fastalloc
Reland commit 719658d078c4

The base RA support infrastructure that only allow a specific register
class be allocated in RA pss. Since greedy RA, basic RA derived from
base RA, they all allow allocating specific register class. Fast RA
doesn't support allocating register for specific register class. This
patch is to enable ShouldAllocateClass in fast RA, so that it can
support allocating register for specific register class.

Differential Revision: https://reviews.llvm.org/D131825
2022-08-13 13:57:34 +08:00
Yaxun (Sam) Liu
e780648a15 [AMDGPU] Unify unreachable intrinsics
si-annotate-control-flow does depth first traversal of BB's of
a function to insert amdgcn if intrinsics for conditional
branches so that isel can generate correct instructions later.

si-annotate-control-flow checks whether the successor BB for the 'else'
branch of a conditional branch has been visited. If it has been
visited, si-annotate-control-flow assumes the conditional
branch has been handled and will not try to insert if intrinsic
for it.

This assumption is not correct when the IR contains multiple
unreachable BB's. Then 'if' intrinscs are not inserted and incorrect
ISA are generated.

This patch fixes the issue by let amdgpu-unify-divergent-exit-nodes
unify unreachables even if they are uniformly reached. In this way
the IR will not contain multiple exits, and structurizer is able to
structurize the IR containing one unified exit.

Reviewed by: Ruiling Song, Matt Arsenault

Differential Revision: https://reviews.llvm.org/D131181

Fixes: SWDEV-343244
2022-08-09 10:23:32 -04:00
Carl Ritson
4c4db81630 [AMDGPU] Extend SILoadStoreOptimizer to s_load instructions
Apply merging to s_load as is done for s_buffer_load.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D130742
2022-07-30 11:38:39 +09:00
Austin Kerbow
ba0d079c7a [AMDGPU] Aggressively schedule to reduce RP in occupancy limited regions
By not clustering loads and adjusting heuristics to more aggressively reduce
register pressure we may be able to increase occupancy for the function if it
was dropped in a first pass scheduling.

Similarly, try to reduce spilling if register usage exceeds lower bound
occupancy.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D130329
2022-07-27 22:34:37 -07:00
Jay Foad
716ca2e3ef [AMDGPU] Pre-sink IR input for some tests
Edit the IR input for some codegen tests to simulate what the IR code
sinking pass would do to it. This makes the tests immune to the presence
or absence of the code sinking pass in the codegen pass pipeline, which
does not belong there.

Differential Revision: https://reviews.llvm.org/D130169
2022-07-21 14:25:44 +01:00
Thomas Symalla
fd64a857ee [AMDGPU] Combine s_or_saveexec, s_xor instructions.
This patch merges a consecutive sequence of

s_or_saveexec s_o, s_i
s_xor exec, exec, s_o

into a single

s_andn2_saveexec s_o, s_i instruction.
This patch also cleans up the SIOptimizeExecMasking pass a bit.

Reviewed By: nhaehnle

Differential Revision: https://reviews.llvm.org/D129073
2022-07-21 14:16:37 +02:00
Jay Foad
9383b09858 [AMDGPU][GlobalISel] Fix subtarget checks for combining to v_med3_i16
Differential Revision: https://reviews.llvm.org/D130243
2022-07-21 11:41:31 +01:00
Jon Chesterfield
3a20597776 [amdgpu] Implement lds kernel id intrinsic
Implement an intrinsic for use lowering LDS variables to different
addresses from different kernels. This will allow kernels that cannot
reach an LDS variable to avoid wasting space for it.

There are a number of implicit arguments accessed by intrinsic already
so this implementation closely follows the existing handling. It is slightly
novel in that this SGPR is written by the kernel prologue.

It is necessary in the general case to put variables at different addresses
such that they can be compactly allocated and thus necessary for an
indirect function call to have some means of determining where a
given variable was allocated. Claiming an arbitrary SGPR into which
an integer can be written by the kernel, in this implementation based
on metadata associated with that kernel, which is then passed on to
indirect call sites is sufficient to determine the variable address.

The intent is to emit a __const array of LDS addresses and index into it.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D125060
2022-07-19 17:46:19 +01:00
Matt Arsenault
8d0383eb69 CodeGen: Remove AliasAnalysis from regalloc
This was stored in LiveIntervals, but not actually used for anything
related to LiveIntervals. It was only used in one check for if a load
instruction is rematerializable. I also don't think this was entirely
correct, since it was implicitly assuming constant loads are also
dereferenceable.

Remove this and rely only on the invariant+dereferenceable flags in
the memory operand. Set the flag based on the AA query upfront. This
should have the same net benefit, but has the possible disadvantage of
making this AA query nonlazy.

Preserve the behavior of assuming pointsToConstantMemory implying
dereferenceable for now, but maybe this should be changed.
2022-07-18 17:23:41 -04:00
Ivan Kosarev
432cbd7827 [AMDGPU][CodeGen] Support (register + immediate) SMRD offsets.
Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D129381
2022-07-18 11:29:31 +01:00
Matt Arsenault
e9a45d45d0 GlobalISel: Allow forming atomic/volatile G_SEXTLOAD
Mirror the change to G_ZEXTLOAD.
2022-07-08 11:55:08 -04:00
Matt Arsenault
1ee6ce9bad GlobalISel: Allow forming atomic/volatile G_ZEXTLOAD
SelectionDAG has a target hook, getExtendForAtomicOps, which it uses
in the computeKnownBits implementation for ATOMIC_LOAD. This is pretty
ugly (as is having a separate load opcode for atomics), so instead
allow making use of atomic zextload. Enable this for AArch64 since the
DAG path defaults in to the zext behavior.

The tablegen changes are pretty ugly, but partially helps migrate
SelectionDAG from using ISD::ATOMIC_LOAD to regular ISD::LOAD with
atomic memory operands. For now the DAG emitter will emit matchers for
patterns which the DAG will not produce.

I'm still a bit confused by the intent of the isLoad/isStore/isAtomic
bits. The DAG implementation rejects trying to use any of these in
combination. For now I've opted to make the isLoad checks also check
isAtomic, although I think having isLoad and isAtomic set on these
makes most sense.
2022-07-08 11:55:08 -04:00
Jay Foad
8fc8bf59f2 [AMDGPU] Add GFX11 test coverage sharing checks with GFX10 2022-07-08 11:56:49 +01:00
Jay Foad
de3b5d7316 [AMDGPU] More GFX11 coverage for tests with generated checks 2022-07-08 11:06:02 +01:00
Jay Foad
a59c3eb2f3 [AMDGPU] Add GFX11 coverage to shared sdag/gisel tests 2022-07-08 09:40:20 +01:00