73 Commits

Author SHA1 Message Date
choikwa
e5638c5a00
[AMDGPU] Use correct number of bits needed for div/rem shrinking (#80622)
There was an error where dividend of type i64 and actual used number of
bits of 32 fell into path that assumes only 24 bits being used. Check
that AtLeast field is used correctly when using computeNumSignBits and
add necessary extend/trunc for 32 bits path.

Regolden and update testcases.

@jrbyrnes @bcahoon @arsenm @rampitec
2024-02-06 21:32:28 +05:30
Fangrui Song
9e9907f1cf
[AMDGPU,test] Change llc -march= to -mtriple= (#75982)
Similar to 806761a7629df268c8aed49657aeccffa6bca449.

For IR files without a target triple, -mtriple= specifies the full
target triple while -march= merely sets the architecture part of the
default target triple, leaving a target triple which may not make sense,
e.g. amdgpu-apple-darwin.

Therefore, -march= is error-prone and not recommended for tests without
a target triple. The issue has been benign as we recognize
$unknown-apple-darwin as ELF instead of rejecting it outrightly.

This patch changes AMDGPU tests to not rely on the default
OS/environment components. Tests that need fixes are not changed:

```
  LLVM :: CodeGen/AMDGPU/fabs.f64.ll
  LLVM :: CodeGen/AMDGPU/fabs.ll
  LLVM :: CodeGen/AMDGPU/floor.ll
  LLVM :: CodeGen/AMDGPU/fneg-fabs.f64.ll
  LLVM :: CodeGen/AMDGPU/fneg-fabs.ll
  LLVM :: CodeGen/AMDGPU/r600-infinite-loop-bug-while-reorganizing-vector.ll
  LLVM :: CodeGen/AMDGPU/schedule-if-2.ll
```
2024-01-16 21:54:58 -08:00
Acim Maravic
48f36c6e74
[LLVM] Make use of s_flbit_i32_b64 and s_ff1_i32_b64 (#75158)
Update DAG ISel to support 64bit versions S_FF1_I32_B64 and
S_FLBIT_I32_B664

---------

Co-authored-by: Acim Maravic <Acim.Maravic@amd.com>
2023-12-25 11:55:20 +01:00
Valery Pykhtin
dd051295bc
[AMDGPU] Enable GCNRewritePartialRegUses pass by default. (#72975)
Let's try once again after #69957 has landed.
2023-12-14 14:10:27 +01:00
Rin
befa925aca
[MachineLICM][AArch64] Hoist COPY instructions with other uses in the loop (#71403)
When there is a COPY instruction in the loop with other uses, we want to
hoist the COPY, which in turn leads to the users being hoisted as well.

Co-authored-by David Green : David.Green@arm.com
2023-11-20 10:01:04 +00:00
Stanislav Mekhanoshin
fe8335babb
[AMDGPU] Select 64-bit imm moves if can be encoded as 32 bit operand (#70395)
This allows folding of 64-bit operands if fit into 32-bit. Fixes
https://github.com/llvm/llvm-project/issues/67781
2023-10-30 08:12:28 -07:00
Pierre van Houtryve
40a426fac6
[AMDGPU] Constant fold FMAD_FTZ (#69443)
Solves #68315
2023-10-19 16:05:51 +02:00
Jay Foad
7b3bbd83c0 Revert "[CodeGen] Really renumber slot indexes before register allocation (#67038)"
This reverts commit 2501ae58e3bb9a70d279a56d7b3a0ed70a8a852c.

Reverted due to various buildbot failures.
2023-10-09 12:31:32 +01:00
Jay Foad
2501ae58e3
[CodeGen] Really renumber slot indexes before register allocation (#67038)
PR #66334 tried to renumber slot indexes before register allocation, but
the numbering was still affected by list entries for instructions which
had been erased. Fix this to make the register allocator's live range
length heuristics even less dependent on the history of how instructions
have been added to and removed from SlotIndexes's maps.
2023-10-09 11:44:41 +01:00
Ivan Kosarev
f04aa1f814
[AMDGPU][CodeGen] Fold immediates in src1 operands of V_MAD/MAC/FMA/FMAC. (#68002) 2023-10-05 14:22:29 +03:00
Matt Arsenault
5b657f50b8 AMDGPU: Move LICM after AMDGPUCodeGenPrepare
The commit that added the run says it's to hoist uniform parts of
integer division expansion. That expansion is performed later, so this
didn't do anything in that case. Move this later so the original test
shows the improvement.

This also saves a run of "Canonicalize natural loops". Not sure why
this appears to be still getting a separate loop PM run. Also feels a
bit heavy to run this just for divide. Is there a way to specifically
hoist the divide sequence when it expands?
2023-06-10 07:37:32 -04:00
Valery Pykhtin
342acfc9bb [AMDGPU] Turn off pass to rewrite partially used virtual superregisters after RenameIndependentSubregs pass with registers of minimal size.
There is a failure with this pass in the case when target register class for a subregister isn't known from instruction description (for ex. COPY).
Currently in this situation the RC is obtained using TargetRegisterInfo::getSubRegisterClass but in general it's not working.

In order to fix this two things should be done:
1. Stop processing a subregister if the target register class is unknown (conservative approach)
2. Improve deduction of subregister' target register class (i.e by processing COPY chain)

I was going to implement point 1 but my tests use implicit operands for S_NOP and they don't have associated target register class and all tests fail.
Therefore I decided to turn off the pass now, implement point 1 and fix my tests.

Reviewed By: arsenm, #amdgpu

Differential Revision: https://reviews.llvm.org/D152291
2023-06-07 12:05:25 +02:00
Valery Pykhtin
8d0412ce9d [AMDGPU] Add pass to rewrite partially used virtual superregisters after RenameIndependentSubregs pass with registers of minimal size.
The main purpose of this is to simplify register pressure tracking as after the pass there is no need
to track subreg liveness anymore.

On the other hand this pass creates more possibilites for the subreg unaware code, as many of the subregs
becomes ordinary registers.

Intersting sideeffect: spill-vgpr.ll has lost a lot of spills.

Reviewed By: #amdgpu, arsenm

Differential Revision: https://reviews.llvm.org/D139732
2023-05-26 09:05:44 +02:00
skc7
b434051dc8 [AMDGPU] Introduce SIInstrWorklist to process instructions in moveToVALU
Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D147168
2023-04-10 11:34:14 +05:30
Nikita Popov
bdf2fbba9c [AMDGPU] Convert some tests to opaque pointers (NFC) 2022-12-19 12:41:13 +01:00
Carl Ritson
c316332e17 [Sink] Allow sinking of invariant loads across critical edges
Invariant loads can always be sunk.

Reviewed By: foad, arsenm

Differential Revision: https://reviews.llvm.org/D135133
2022-10-06 09:21:12 +09:00
Alexander Timofeev
fbdea5a2e9 [AMDGPU] Always select s_cselect_b32 for uniform 'select' SDNode
This patch contains changes necessary to carry physical condition register (SCC) dependencies through the SDNode scheduler.  It adds the edge in the SDNodeScheduler dependency graph instead of inserting the SCC copy between each definition and use. This approach lets the scheduler place instructions in an optimal way placing the copy only when the dependency cannot be resolved.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D133593
2022-09-15 22:03:56 +02:00
Craig Topper
ace05124f5 [IntegerDivision][AMDGPU] Use CreateLogicalOr to block poison propagation.
There are two ctlz intrinsics here with the zero_is_poison flag
set. There are also two comparisons that check if either of the
inputs the ctlzs are zero. We need to use a logical or to block
the poison from the ctlz if either of the inputs is zero.

Reviewed By: arsenm, aqjune

Differential Revision: https://reviews.llvm.org/D130680
2022-09-15 09:38:02 -07:00
Nuno Lopes
858fe8664e Expand Div/Rem: consider the case where the dividend is zero
So we can't use ctlz in poison-producing mode
2022-09-01 17:04:26 +01:00
Alexander Timofeev
a321d95b59 [AMDGPU] avoid blind converting to VALU REG_SEQUENCE and PHIs
In the 2e29b0138ca243 we introduce a specific solving algorithm
that analyzes the VGPR to SGPR copies use chains and either lowers
the copy to v_readfirstlane_b32 or converts the whole chain to VALU forms.
Same time we still have the code that blindly converts to VALU REG_SEQUENCE and PHIs
in case they produce SGPR but have VGPRs input operands. In case the REG_SEQUENCE and PHIs
are in the VGPR to SGPR copy use chain, and this chain was considered long enough to convert
copy to v_readfistlane_b32, further lowering them to VALU leads to several kinds of issues.
At first, we have v_readfistlane_b32 which is completely useless because most parts of its use chain
were moved to VALU forms. Second, we may encounter subtle bugs related to the EXEC-dependent CF
because of the weird mixing of SALU and VALU instructions.
This change removes the code that moves REG_SEQUENCE and PHIs to VALU. Instead, we use the fact
that both REG_SEQUENCE and PHIs have copy semantics. That is, if they define SGPR but have VGPR inputs,
we insert VGPR to SGPR copies to make them pure SGPR. Then, the new copies are processed by the common
VGPR to SGPR lowering algorithm.
This is Part 2 in the series of commits aiming at the massive refactoring of the SIFixSGPRCopies pass.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D130367
2022-08-02 18:37:57 +02:00
Carl Ritson
4c4db81630 [AMDGPU] Extend SILoadStoreOptimizer to s_load instructions
Apply merging to s_load as is done for s_buffer_load.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D130742
2022-07-30 11:38:39 +09:00
Alexander Timofeev
d7ae1a9097 Revert "[AMDGPU] avoid blind converting to VALU REG_SEQUENCE and PHIs"
This reverts commit 76d9ae924cc361578ecbb5688559f7cebc78ab87.
because it causes several VK CTS tests to fail
2022-07-29 14:19:07 +02:00
Alexander Timofeev
76d9ae924c [AMDGPU] avoid blind converting to VALU REG_SEQUENCE and PHIs
In the 2e29b0138ca243 we introduce a specific solving algorithm
that analyzes the VGPR to SGPR copies use chains and either lowers
the copy to v_readfirstlane_b32 or converts the whole chain to VALU forms.
Same time we still have the code that blindly converts to VALU REG_SEQUENCE and PHIs
in case they produce SGPR but have VGPRs input operands. In case the REG_SEQUENCE and PHIs
are in the VGPR to SGPR copy use chain, and this chain was considered long enough to convert
copy to v_readfistlane_b32, further lowering them to VALU leads to several kinds of issues.
At first, we have v_readfistlane_b32 which is completely useless because most parts of its use chain
were moved to VALU forms. Second, we may encounter subtle bugs related to the EXEC-dependent CF
because of the weird mixing of SALU and VALU instructions.
This change removes the code that moves REG_SEQUENCE and PHIs to VALU. Instead, we use the fact
that both REG_SEQUENCE and PHIs have copy semantics. That is, if they define SGPR but have VGPR inputs,
we insert VGPR to SGPR copies to make them pure SGPR. Then, the new copies are processed by the common
VGPR to SGPR lowering algorithm.
This is Part 2 in the series of commits aiming at the massive refactoring of the SIFixSGPRCopies pass.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D130367
2022-07-28 14:30:29 +02:00
Simon Pilgrim
529bd4f352 [DAG] SimplifyDemandedBits - don't early-out for multiple use values
SimplifyDemandedBits currently early-outs for multi-use values beyond the root node (just returning the knownbits), which is missing a number of optimizations as there are plenty of cases where we can still simplify when initially demanding all elements/bits.

@lenary has confirmed that the test cases in aea-erratum-fix.ll need refactoring and the current increase codegen is not a major concern.

Differential Revision: https://reviews.llvm.org/D129765
2022-07-27 10:54:06 +01:00
Alexander Timofeev
2e29b0138c [AMDGPU] Lowering VGPR to SGPR copies to v_readfirstlane_b32 if profitable.
Since the divergence-driven instruction selection has been enabled for AMDGPU,
 all the uniform instructions are expected to be selected to SALU form, except those not having one.
 VGPR to SGPR copies appear in MIR to connect values producers and consumers. This change implements an algorithm
 that evolves a reasonable tradeoff between the profit achieved from keeping the uniform instructions in SALU form
 and overhead introduced by the data transfer between the VGPRs and SGPRs.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D128252
2022-07-14 23:59:02 +02:00
Jay Foad
e2926501d8 [AMDGPU] Aggressively fold immediates in SIShrinkInstructions
Fold immediates regardless of how many uses they have. This is expected
to increase overall code size, but decrease register usage.

Differential Revision: https://reviews.llvm.org/D114644
2022-05-18 11:04:33 +01:00
Jay Foad
3eb2281bc0 [AMDGPU] Aggressively fold immediates in SIFoldOperands
Previously SIFoldOperands::foldInstOperand would only fold a
non-inlinable immediate into a single user, so as not to increase code
size by adding the same 32-bit literal operand to many instructions.

This patch removes that restriction, so that a non-inlinable immediate
will be folded into any number of users. The rationale is:
- It reduces the number of registers used for holding constant values,
  which might increase occupancy. (On the other hand, many of these
  registers are SGPRs which no longer affect occupancy on GFX10+.)
- It reduces ALU stalls between the instruction that loads a constant
  into a register, and the instruction that uses it.
- The above benefits are expected to outweigh any increase in code size.

Differential Revision: https://reviews.llvm.org/D114643
2022-05-18 10:19:35 +01:00
Craig Topper
fa630e7594 [RISCV][AMDGPU][TargetLowering] Special case overflow expansion for (uaddo X, 1).
If we expand (uaddo X, 1) we previously expanded the overflow calculation
as (X + 1) <u X. This potentially increases the live range of X and
can prevent X+1 from reusing the register that previously held X.

Since we're adding 1, overflow only occurs if X was UINT_MAX in which
case (X+1) would be 0. So this patch adds a special case to expand
the overflow calculation to (X+1) == 0.

This seems to help with uaddo intrinsics that get introduced by
CodeGenPrepare after LSR. Alternatively, we could block the uaddo
transform in CodeGenPrepare for this case.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D122933
2022-04-01 13:14:10 -07:00
Venkata Ramanaiah Nalamothu
04fff547e2 [AMDGPU] Move call clobbered return address registers s[30:31] to callee saved range
Currently the return address ABI registers s[30:31], which fall in the call
clobbered register range, are added as a live-in on the function entry to
preserve its value when we have calls so that it gets saved and restored
around the calls.

But the DWARF unwind information (CFI) needs to track where the return address
resides in a frame and the above approach makes it difficult to track the
return address when the CFI information is emitted during the frame lowering,
due to the involvment of understanding the control flow.

This patch moves the return address ABI registers s[30:31] into callee saved
registers range and stops adding live-in for return address registers, so that
the CFI machinery will know where the return address resides when CSR
save/restore happen during the frame lowering.

And doing the above poses an issue that now the return instruction uses undefined
register `sgpr30_sgpr31`. This is resolved by hiding the return address register
use by the return instruction through the `SI_RETURN` pseudo instruction, which
doesn't take any input operands, until the `SI_RETURN` pseudo gets lowered to the
`S_SETPC_B64_return` during the `expandPostRAPseudo()`.

As an added benefit, this patch simplifies overall return instruction handling.

Note: The AMDGPU CFI changes are there only in the downstream code and another
version of this patch will be posted for review for the downstream code.

Reviewed By: arsenm, ronlieb

Differential Revision: https://reviews.llvm.org/D114652
2022-03-09 12:18:02 +05:30
Carl Ritson
565af157ef [AMDGPU] Extend pre-emit peephole to redundantly masked VCC
Extend pre-emit peephole for S_CBRANCH_VCC[N]Z to eliminate
redundant S_AND operations against EXEC for V_CMP results in VCC.
These occur after after register allocation when VCC has been
selected as the comparison destination.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D120202
2022-02-25 10:18:31 +09:00
Ron Lieberman
09b53296cf Revert "[AMDGPU] Move call clobbered return address registers s[30:31] to callee saved range"
This reverts commit 9075009d1fd5f2bf9aa6c2f362d2993691a316b3.

 Failed amdgpu runtime buildbot # 3514
2021-12-22 11:39:28 -05:00
RamNalamothu
9075009d1f [AMDGPU] Move call clobbered return address registers s[30:31] to callee saved range
Currently the return address ABI registers s[30:31], which fall in the call
clobbered register range, are added as a live-in on the function entry to
preserve its value when we have calls so that it gets saved and restored
around the calls.

But the DWARF unwind information (CFI) needs to track where the return address
resides in a frame and the above approach makes it difficult to track the
return address when the CFI information is emitted during the frame lowering,
due to the involvment of understanding the control flow.

This patch moves the return address ABI registers s[30:31] into callee saved
registers range and stops adding live-in for return address registers, so that
the CFI machinery will know where the return address resides when CSR
save/restore happen during the frame lowering.

And doing the above poses an issue that now the return instruction uses undefined
register `sgpr30_sgpr31`. This is resolved by hiding the return address register
use by the return instruction through the `SI_RETURN` pseudo instruction, which
doesn't take any input operands, until the `SI_RETURN` pseudo gets lowered to the
`S_SETPC_B64_return` during the `expandPostRAPseudo()`.

As an added benefit, this patch simplifies overall return instruction handling.

Note: The AMDGPU CFI changes are there only in the downstream code and another
version of this patch will be posted for review for the downstream code.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D114652
2021-12-22 20:51:12 +05:30
Austin Kerbow
da067ed569 [AMDGPU] Set most sched model resource's BufferSize to one
Using a BufferSize of one for memory ProcResources will result in better
ILP since it more accurately models the dependencies between memory ops
and their consumers on an in-order processor. After this change, the
scheduler will treat the data edges from loads as blocking so that
stalls are guaranteed when waiting for data to be retreaved from memory.
Since we don't actually track waitcnt here, this should do a better job
at modeling their behavior.

Practically, this means that the scheduler will trigger the 'STALL'
heuristic more often.

This type of change needs to be evaluated experimentally. Preliminary
results are positive.

Fixes: SWDEV-282962

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D114777
2021-12-01 22:31:28 -08:00
RamNalamothu
18f9351223 [AMDGPU] Do not generate ELF symbols for the local branch target labels
The compiler was generating symbols in the final code object for local
branch target labels. This bloats the code object, slows down the loader,
and is only used to simplify disassembly.

Use '--symbolize-operands' with llvm-objdump to improve readability of the
branch target operands in disassembly.

Fixes: SWDEV-312223

Reviewed By: scott.linder

Differential Revision: https://reviews.llvm.org/D114273
2021-11-20 10:32:41 +05:30
Jay Foad
30b27ecfc2 [AMDGPU] Use new opcode for indexed vgpr reads
Introduce V_MOV_B32_indirect_read for indexed vgpr reads
(and rename the old V_MOV_B32_indirect to
V_MOV_B32_indirect_write) so they can be unambiguously
distinguished from regular V_MOV_B32_e32. Previously they
were distinguished by looking for extra implicit operands
but this is fragile because regular moves sometimes have
extra implicit operands too:
- either by accident, when instructions end up with
  duplicate implicit operands (see e.g. D100939)
- or by design, when SIInstrInfo::copyPhysReg breaks a
  multi-dword copy into individual subreg mov instructions
  and adds implicit operands for the super-register.

The effect of this is that SIInstrInfo::isFoldableCopy can
be simplified and identifies more foldable copies. The test
diffs show that more immediate 0 values have been folded as
inline operands.

SIInstrInfo::isReallyTriviallyReMaterializable could
probably be simplified too but that is not part of this
patch.

Differential Revision: https://reviews.llvm.org/D114230
2021-11-19 13:08:11 +00:00
Jay Foad
a70bbb5f7a [AMDGPU] Simplify 64-bit division/remainder expansion
The old expansion open-coded a 64-bit addition in a strange way, by
adding the high parts *without* carry-in from the low part, and then
adding the carry back in later on. Fixing this saves a couple of
instructions and makes the code much easier to understand.

Differential Revision: https://reviews.llvm.org/D113679
2021-11-12 15:48:41 +00:00
Stanislav Mekhanoshin
c80d8a8cea [AMDGPU] MachineLICM cannot hoist VALU
MachineLoop::isLoopInvariant() returns false for all VALU
because of the exec use. Check TII::isIgnorableUse() to
allow hoisting.

That unfortunately results in higher register consumption
since MachineLICM does not adequately estimate pressure.
Therefor I think it shall only be enabled after D107677 even
though it does not depend on it.

Differential Revision: https://reviews.llvm.org/D107859
2021-10-20 11:47:24 -07:00
Jay Foad
c885857e9d [AMDGPU] Enable load clustering in the post-RA scheduler
This has a couple of benefits:
1. It can sometimes fix clusters that got broken apart when the register
   allocator inserted a copy.
2. Post-RA scheduling does not have to worry about increasing register
   pressure, which in some cases gives it more freedom to reorder
   instructions.

Testing on a collection of 10,000 graphics shaders compiled for gfx1010
showed:
- The average length of each run of one or more load instructions
  increased by about 1%.
- The number of runs of two or more load instructions increased by
  about 4%.

Differential Revision: https://reviews.llvm.org/D111646
2021-10-13 17:12:26 +01:00
Jay Foad
66ce1015af Revert "[AMDGPU] Enable load clustering in the post-RA scheduler"
This reverts commit 66e13c7f439cf162d7ed1d25883e71a5755ac7ec.

It was committed by accident.
2021-10-12 16:19:35 +01:00
Jay Foad
66e13c7f43 [AMDGPU] Enable load clustering in the post-RA scheduler
This has a couple of benefits:
1. It can sometimes fix clusters that got broken apart when the register
   allocator inserted a copy.
2. Post-RA scheduling does not have to worry about increasing register
   pressure, which in some cases gives it more freedom to reorder
   instructions.

Testing on a collection of 10,000 graphics shaders compiled for gfx1010
showed:
- The average length of each run of one or more load instructions
  increased by about 1%.
- The number of runs of two or more load instructions increased by
  about 4%.
2021-10-12 16:09:04 +01:00
Joe Nash
3ce1b9631a [AMDGPU] Switch PostRA sched to MachineSched
Use GCNHazardRecognizer in postra sched.
Updated tests for the new schedules.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D109536

Change-Id: Ia86ba2ae168f12fb34b4d8efdab491f84d936cde
2021-09-14 15:11:27 -04:00
alex-t
ed0f4415f0 [AMDGPU] Divergence-driven compare operations instruction selection
Description: This change enables the compare operations to be selected to SALU/VALU form
             dependent of the SDNode divergence flag.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D106079
2021-08-25 18:30:49 +03:00
Matt Arsenault
d719f1c3cc AMDGPU: Add alloc priority to global ranges
The requested register class priorities weren't respected
globally. Not sure why this is a target option, and not just the
expected behavior (recently added in
1a6dc92be7d68611077f0fb0b723b361817c950c). This avoids an allocation
failure when many wide tuple spills are introduced. I think this is a
workaround since I would not expect the allocation priority to be
required, and only a performance hint. The allocator should be smarter
about when only a subregister needs to be spilled and restored.

This does regress a couple of degenerate store stress lit tests which
shouldn't be too important.
2021-08-10 13:12:34 -04:00
Jay Foad
e6c364a624 [AMDGPU][SDag] Better lowering for 64-bit ctlz/cttz
Differential Revision: https://reviews.llvm.org/D107546
2021-08-05 15:57:40 +01:00
Craig Topper
c23405174a [DAGCombiner][AMDGPU] Canonicalize constants to the RHS of MULHU/MULHS.
This allows special constants like to 0 to be recognized. It's also
expected by isel patterns if a target had a mulh with immediate instructions.
The commuting done by tablegen won't commute patterns with immediates since it
expects DAGCombine to have done it.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D107486
2021-08-04 11:39:23 -07:00
Stanislav Mekhanoshin
4a3b055653 [AMDGPU] Fix flags of V_MOV_B64_PSEUDO
In particular it was not rematerializable.

Differential Revision: https://reviews.llvm.org/D105724
2021-07-09 12:49:28 -07:00
Stanislav Mekhanoshin
381ded345b [AMDGPU] Add S_MOV_B64_IMM_PSEUDO for wide constants
This is to allow 64 bit constant rematerialization. If a constant
is split into two separate moves initializing sub0 and sub1 like
now RA cannot rematerizalize a 64 bit register.

This gives 10-20% uplift in a set of huge apps heavily using double
precession math.

Fixes: SWDEV-292645

Differential Revision: https://reviews.llvm.org/D104874
2021-06-30 11:45:38 -07:00
Brendon Cahoon
3f7b7e7393 [AMDGPU] Update SCC defs to VCC when uses are changed to VCC
The FixSGPRCopies pass converts instructions to VALU when
removing illegal VGPR to SGPR copies. Instructions that use SCC
are changed to use VCC instead. When that happens, the pass must
also change instructions that define SCC to define VCC.

The pass was not changing the SCC definition when an ADDC is
converted due to a input that is a VGPR to SGPR copy. But, the
initial ADD insruction, which define SCC, is not converted.
This causes a compilation failure due to a use of an undefined
physical register.

This patch adds code that inserts the SCC definition in the
MoveToVALU worklist when a SCC use is converted to a VCC use.

Differential Revision: https://reviews.llvm.org/D102111
2021-05-14 18:05:05 -04:00
Jay Foad
ef443390a9 [AMDGPU] Remove MachineDCE after SIFoldOperands
Remove the MachineDCE pass after the first SIFoldOperands pass now
that SIFoldOperands deletes its own dead instructions.

Reapply after fixing dependent change D100188.

Differential Revision: https://reviews.llvm.org/D100189
2021-04-19 12:08:02 +01:00
Mitch Phillips
092f288d36 Revert "[AMDGPU] Remove MachineDCE after SIFoldOperands"
This reverts commit 5a0117b2d0eaedffeeb393bd9915f11cccfe241b.

Reason: Dependent change d19a42eba98fe853dd52f7dc89d8cd2727c7fc1c broke
the ASan buildbots.
2021-04-09 15:47:44 -07:00