6693 Commits

Author SHA1 Message Date
Pravin Jagtap
c931f2e6fd [AMDGPU] Autogenerate & pre-commit tests for D156301 and D157388
Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D157712
2023-08-18 09:50:44 -04:00
Carl Ritson
ad9eed1e77 [MachineVerifier] Verify LiveIntervals for PHIs
Implement basic support for verifying LiveIntervals for PHIs.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D156872
2023-08-18 18:14:22 +09:00
Joe Nash
6aab000874 [AMDGPU] Convert fmul-2-combine-multi-use test to auto-gen
NFC. Deletes the unused SI runline.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D158198
2023-08-17 14:23:20 -04:00
Pravin Jagtap
af5fd142d3 [AMDGPU] Extend f32 support for llvm.amdgcn.update.dpp intrinsic
This will be useful to avoid the bit-casting noise
required to extend support for Floating Point
Operations in atomic optimizer for DPP in D156301

Reviewed By: arsenm, #amdgpu

Differential Revision: https://reviews.llvm.org/D156647
2023-08-17 10:45:19 -04:00
Christudasan Devadasan
81827f8cfb [AMDGPU] Support wwm-reg AV spill pseudos
The wwm register spill pseudos are currently defined for VGPR_32
regclass. It causes a verifier error for gfx908 or above as the
regalloc sometimes restores the values to the vector superclass AV_32.
Fixing it by supporting AV wwm-spill pseudos as well.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D155646
2023-08-17 20:04:18 +05:30
Tuan Chuong Goh
74e191da07 [AArch64][GlobalISel] Combiner for EXT
Keep components of UNMERGE larger after running the Artifact Combiner on it.
This was intended to help with <v16i64> = G_SEXT <v16i16>, but implementation
for legalizing EXT is in a following patch, therefore a test for this case
will be included in the following patch

Differential Revision: https://reviews.llvm.org/D157715
2023-08-17 14:34:45 +01:00
Matt Arsenault
c9d0d15e69 AMDGPU: Refine some rsq formation tests
Drop unnecessary flags and metadata, add contract flags that should be
necessary.
2023-08-16 13:37:03 -04:00
Matt Arsenault
66ee794064 AMDGPU: Fix verifier error on splatted opencl fmin/fmax and ldexp calls
Apparently the spec has overloads for fmin/fmax and ldexp with one of
the operands as scalar. We need to broadcast the scalars to the vector
type.

https://reviews.llvm.org/D158077
2023-08-16 09:42:26 -04:00
Ivan Kosarev
d7efe41598 [AMDGPU] Autogenerate the v_cndmask.ll and llvm.amdgcn.image.msaa.load.ll codegen tests.
Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D157970
2023-08-16 12:50:51 +01:00
Ivan Kosarev
f9ab235318 [AMDGPU] Autogenerate the fmuladd.f16.ll and llvm.fmuladd.f16.ll codegen tests.
Reviewed By: Joe_Nash

Differential Revision: https://reviews.llvm.org/D157966
2023-08-16 12:49:45 +01:00
Carl Ritson
d0e246ff16 [LiveRange] Fix inaccurate verification of live-in PhysRegs
Fix verification that a PhysReg is live in to an MBB.
isLiveIn does not handle reg units, so cannot identify when a
register would be defined because its super register is partially
defined.
Additionally a PhysReg may be partial defined at block entry and
then fully defined before any use.

Reviewed By: foad, arsenm

Differential Revision: https://reviews.llvm.org/D157086
2023-08-16 17:42:42 +09:00
Joe Nash
a093032981 [AMDGPU][True16] Update FPToI1Pat GFX11 pat to use GFX11 instruction
These cmp patterns were using the pre-GFX11 pseudo instruction, and so
failed to compile.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D157912
2023-08-15 11:11:17 -04:00
Matt Arsenault
d251761660 AMDGPU: Replace log libcalls with log intrinsics 2023-08-15 10:48:46 -04:00
Matt Arsenault
81b278e613 AMDGPU: Fix fast f32 exp2
Mirror of the previous log changes, OpenCL conformance doesn't like
interpreting afn as ignore denormal handling but was previously hidden
by flag dropping.
2023-08-15 10:48:46 -04:00
Matt Arsenault
4b7b4b9458 AMDGPU: Fix fast f32 log/log10
OpenCL conformance didn't like interpreting afn as ignore the denormal
handling.

https://reviews.llvm.org/D157940
2023-08-15 10:48:46 -04:00
Matt Arsenault
e09b3593ba AMDGPU: Fix fast math log2 f32
Apparently afn doesn't allow you to drop the denormal handling
according to OpenCL conformance. This was hidden by losing the flags
during the library linking process. Fast log is still broken and needs
more work.

https://reviews.llvm.org/D157936
2023-08-15 10:48:46 -04:00
Jay Foad
f0e5f73fdc [MachineScheduler] Account for lane masks in basic block liveins
Differential Revision: https://reviews.llvm.org/D157633
2023-08-15 09:52:43 +01:00
Matt Arsenault
1faa4797ca AMDGPU: Handle unsafe exp.f32 with denormal handling
I somehow missed this path when adding the new expansions. Saves a lot
of instructions for afn + IEEE.

https://reviews.llvm.org/D157867
2023-08-14 18:36:01 -04:00
Matt Arsenault
d45022b094 AMDGPU: Remove special case constant folding of divide
We should probably just swap this out for the fdiv, but that's what
the implementation is anyway.
2023-08-14 18:36:01 -04:00
Matt Arsenault
0eabe65bfb AMDGPU: Replace ldexp libcalls with intrinsic 2023-08-14 18:36:01 -04:00
Matt Arsenault
f337a77c99 AMDGPU: Replace rounding libcalls with intrinsics 2023-08-14 18:36:01 -04:00
Matt Arsenault
c7876c55ac AMDGPU: Replace fabs and copysign libcalls with intrinsics
Preserves flags and metadata like the other cases.
2023-08-14 18:28:21 -04:00
Matt Arsenault
a70006c4c5 AMDGPU: Replace some libcalls with intrinsics
OpenCL loses fast math information by going through libcall wrappers
around intrinsics.

Do this to preserve call site flags which are lost when inlining. It's
not safe in general to propagate flags during inline, so avoid dealing
with this by just special casing some of the useful calls.
2023-08-14 18:20:47 -04:00
Joe Nash
dc242f9f1e [AMDGPU][NFC] Convert fpto{u|s}i f16 tests to auto-gen
Makes it easier to add GFX11 runline in future patch, which has
significantly different output.
2023-08-14 15:28:20 -04:00
Matt Arsenault
a8376bbe53 AMDGPU: Add baseline tests for libcall to intrinsic handling
Test all the different itanium mangled opencl functions that are
interesting to replace with raw intrinsic calls.

https://reviews.llvm.org/D157873
2023-08-14 15:15:30 -04:00
Matt Arsenault
f44beecb78 AMDGPU: Try to use private version of sincos if available
The comment was out of date, the device libs build does provide all
the pointer overloads. An extremely pedantic interpretation of the
spec would suggest only the flat version exists, but the overloads do
exist in the implementation.

https://reviews.llvm.org/D156720
2023-08-14 11:40:04 -04:00
Matt Arsenault
58fd1de09f AMDGPU: Consider nobuiltin when querying defined libfuncs
https://reviews.llvm.org/D156708
2023-08-14 11:30:12 -04:00
Matt Arsenault
42c6e4209c AMDGPU: Handle multiple uses when matching sincos
Match how the generic implementation handles this. We now will leave
behind the dead other user for later passes to deal with.

https://reviews.llvm.org/D156707
2023-08-14 11:28:41 -04:00
Nikita Popov
9deee6bffa [SDAG] Don't transfer !range metadata without !noundef to SDAG (PR64589)
D141386 changed the semantics of !range metadata to return poison
on violation. If !range is combined with !noundef, violation is
immediate UB instead, matching the old semantics.

In theory, these IR semantics should also carry over into SDAG.
In practice, DAGCombine has at least one key transform that is
invalid in the presence of poison, namely the conversion of logical
and/or to bitwise and/or (c7b537bf09/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp (L11252)).
Ideally, we would fix this transform, but this will require
substantial work to avoid codegen regressions.

In the meantime, avoid transferring !range metadata without
!noundef, effectively restoring the old !range metadata semantics
on the SDAG layer.

Fixes https://github.com/llvm/llvm-project/issues/64589.

Differential Revision: https://reviews.llvm.org/D157685
2023-08-14 09:04:27 +02:00
Christudasan Devadasan
bd7c6e3c48 [AMDGPU] Precommit lit test for wwm-reg AV spill pseudos D155646. 2023-08-12 16:18:18 +05:30
Konrad Kusiak
4fa8a5487e [AMDGPU] Add sanity check that fixes bad shift operation in AMD backend
There is a problem with the
SILoadStoreOptimizer::dmasksCanBeCombined() function that can lead to
UB.

This boolean function decides if two masks can be combined into 1. The
idea here is that the bits which are "on" in one mask, don't overlap
with the "on" bits of the other. Consider an example (10 bits for
simplicity):

Mask 1: 0101101000
Mask 2: 0000000110

Those can be combined into a single mask: 0101101110.

To check if such an operation is possible, the code takes the mask
which is greater and counts how many 0s there are, starting from the
LSB and stopping at the first 1. Then, it shifts 1u by this number and
compares it with the smaller mask. The problem is that when both masks
are 0, the counter will find 32 zeroes in the first mask and will try
to do a shift by 32 positions which leads to UB.

The fix is a simple sanity check, if the bigger mask is 0 or not.

https://reviews.llvm.org/D155051
2023-08-11 15:26:35 -04:00
Mirko Brkusanin
1e5359c6ba [AMDGPU] Treat KIMM32 and KIMM16 operand types as noninlinable
While they are represent 32/16 bit immediate values they are already
included in encoding of the instructions that use them and are not true
literals. FMAMK and FMAAK instructions that use them are marked with fixed
size so getInstSizeInBytes will not increase the size for these operands.

We also add tests whose logic relies on KIMM16 and KIMM32 being considered
not inlinable.

Differential Revision: https://reviews.llvm.org/D157624
2023-08-11 18:46:39 +02:00
Jeffrey Byrnes
f76ffc1f40 [MCP] Invalidate copy for super register in copy source
We must also track the super sources of a copy, otherwise we introduce a sort of subtle bug.

Consider:

1.  DEF r0:r1
2.  USE r1
3.  r6:r9 = COPY r10:r13
4.  r14:15 = COPY r0:r1
5.  USE r6
6.. r1:4 = COPY r6:9

BackwardCopyPropagateBlock processes the instructions from bottom up. After processing 6., we will have propagatable copy for r1-r4 and r6-r9. After 5., we invalidate and erase the propagatble copy for r1-r4 and r6 but not for r7-r9.

The issue is that when processing 3., data structures still say we have valid copies for dest regs r7-r9 (from 6.). The corresponding defs for these registers in 6. are r1:r4, which we mark as registers to invalidate. When invalidating, we find the copy that corresponds to r1 is 4. (this was added when processing 4.), and we say that r1 now maps to unpropagatable copies. Thus, when we process 2., we do not have a valid copy, but when we process 1. we do -- because the mapped copy for subregister r0 was never invalidated.

The net result is to propagate the copy from 4. to 1., and replace DEF r0:r1 with DEF r14:r15. Then, we have a use before def in 2.

The main issue is that we have an inconsitent state between which def regs and which src regs are valid. When processing 5., we mark all the defs in 6. as invalid, but only the subreg use as invalid. Either we must only invalidate the individual subreg for both uses and defs, or the super register for both.

Differential Revision: https://reviews.llvm.org//D157564

Change-Id: I99d5e0b1a0d735e8ea3bd7d137b6464690aa9486
2023-08-11 09:01:18 -07:00
Jeffrey Byrnes
d0e54e377b [AMDGPU] Extend CalculateByteProvider to capture vectors and signed
Differential Revision: https://reviews.llvm.org/D157133

Change-Id: I9ba8727b4ac5a627de2f7d87d2169eb79e01f0ee
2023-08-11 08:47:17 -07:00
Joe Nash
2fb4bfa5ba [AMDGPU][True16] Fix ISel for A16 Image Instructions
The 16-bit VAddr arguments to A16 image instructions are packed into
legal VGPR_32 operands in AMDGPULegalizerInfo::legalizeImageIntrinsic on
all subtargets. With True16, we also need to pack if the number of VAddr is one
because VGPR_16 is not a legal argument to those Image instructions.

No change to emitted code intended on subtargets pre-GFX11, and none on GFX11
until True16 is active.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D157426
2023-08-11 11:12:16 -04:00
Matt Arsenault
8f18cf77e7 AMDGPU: Check for implicit defs before constant folding instruction
Can't delete the constant folded instruction if scc is used.

Fixes #63986

https://reviews.llvm.org/D157504
2023-08-11 10:29:53 -04:00
Matt Arsenault
1030483561 AMDGPU/GlobalISel: Handle stacksave/stackrestore
https://reviews.llvm.org/D156670
2023-08-11 10:25:01 -04:00
Matt Arsenault
9a53f5f5c4 AMDGPU: Handle llvm.stacksave and llvm.stackrestore
Not sure if the only valid use is to have stackrestore directly
consume stacksave outputs or not. Handled exactly like a regular stack
pointer so all the edge cases theoretically should work.

https://reviews.llvm.org/D156669
2023-08-11 10:25:01 -04:00
Joe Nash
ef79d9e38e [AMDGPU][NFC] Regenerate CHECKs as pre-commit for D157426 2023-08-11 09:55:59 -04:00
Matt Arsenault
29fff3e2ab AMDGPU: Try to select fmul by power of 2 to ldexp
For the f64 case, this gives us a cheaper to materialize 32-bit
constant. It's less obviously a win for f32 and f16. It forces us to
use a VOP3 encoding so it's a neutral code size change.

GlobalISel cases don't work because of the constant-is-copy-to-vgpr
problem.

https://reviews.llvm.org/D157111
2023-08-11 07:57:55 -04:00
Matt Arsenault
c8a4f2a8c1 AMDGPU: Add baseline tests for fmul-to-ldexp patterns
We can better some multiply-by-power-of-2 patterns as ldexp.
2023-08-11 07:57:55 -04:00
Stanislav Mekhanoshin
02046ad944 [AMDGPU] W/a for gfx940 byte0 fp8 conversion bug
VOP1 form of these do not work.

Differential Revision: https://reviews.llvm.org/D157683
2023-08-11 02:21:21 -07:00
pvanhout
490a867f16 [GlobalISel] Also set dead flags of implicit defs added by BuildMI
BuildMI automatically adds the implicit operands of the
instruction. This meant we couldn''t set the dead flag on
dead implicit defs in that case.

Fix it by introducing an opcode to mark a given implicit
def as dead.

Fixes #64565

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D157515
2023-08-11 08:38:37 +02:00
pvanhout
89e91e4c0c [AMDGPU] Remove post-PromoteAlloca SROA run
PromoteAlloca now uses SSAUpdater, it doesn't need SROA to clean-up after it anymore.

Internal testing shows no noticeable performance impact.

Reviewed By: #amdgpu, arsenm

Differential Revision: https://reviews.llvm.org/D156398
2023-08-11 08:29:21 +02:00
Matt Arsenault
7575ee7167 AMDGPU: Add more test coverage for FP-typed atomicrmw xchg 2023-08-10 17:38:25 -04:00
Changpeng Fang
1e22873ef4 [AMDGPU][NFC] Rename two LIT test files 2023-08-10 11:31:14 -07:00
Jay Foad
3091bdb86d [AMDGPU] Do not release VGPRs at -O0
This was an oversight when the GFX11 early release VGPRs optimization
was reimplemented in D153279.

Sending the DEALLOC_VGPRS message is a performance optimization so there
is no need to do it at -O0. In addition it makes some kinds of post
mortem debugging hard or impossible, since VGPR values are no longer
available to inspect at the s_endpgm instruction.

Differential Revision: https://reviews.llvm.org/D157599
2023-08-10 14:58:06 +01:00
Matt Arsenault
6dbd458128 AMDGPU: Remove pointless libcall optimization of fma/mad
After the library is linked and trivially inlined, the generic fma and
fmuladd intrinsics already handle these cases, and with precise flag
handling. This was requiring all fast math flags when we really just
need nsz for the fma(a, b, 0) case.

https://reviews.llvm.org/D156677
2023-08-09 19:37:52 -04:00
Matt Arsenault
6448d5ba58 AMDGPU: Remove pointless libcall recognition of native_{divide|recip}
This was trying to constant fold these calls, and also turn some of
them into a regular fmul/fdiv. There's no point to doing that, the
underlying library implementation should be using those in the first
place. Even when the library does use the rcp intrinsics, the backend
handles constant folding of those. This was also only performing the
folds under overly strict fast-evertyhing-is-required conditions.

The one possible plus this gained over linking in the library is if
you were using all fast math flags, it would propagate them to the new
instructions. We could address this in the library by adding more fast
math flags to the native implementations.

The constant fold case also had no test coverage.

https://reviews.llvm.org/D156676
2023-08-09 18:48:46 -04:00
Matt Arsenault
58e87c961e AMDGPU: Port AMDGPULowerKernelArguments to new pass manager
https://reviews.llvm.org/D157498
2023-08-09 18:34:30 -04:00