26 Commits

Author SHA1 Message Date
Matt Arsenault
b5bc205d75 AMDGPU: Convert some bit operation tests to opaque pointers 2022-11-29 18:36:53 -05:00
Alexander Timofeev
fbdea5a2e9 [AMDGPU] Always select s_cselect_b32 for uniform 'select' SDNode
This patch contains changes necessary to carry physical condition register (SCC) dependencies through the SDNode scheduler.  It adds the edge in the SDNodeScheduler dependency graph instead of inserting the SCC copy between each definition and use. This approach lets the scheduler place instructions in an optimal way placing the copy only when the dependency cannot be resolved.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D133593
2022-09-15 22:03:56 +02:00
Simon Pilgrim
69d5a038b9 [DAG] Enable ISD::SRL SimplifyMultipleUseDemandedBits handling inside SimplifyDemandedBits
This patch allows SimplifyDemandedBits to call SimplifyMultipleUseDemandedBits in cases where the ISD::SRL source operand has other uses, enabling us to peek through the shifted value if we don't demand all the bits/elts.

This is another step towards removing SelectionDAG::GetDemandedBits and just using TargetLowering::SimplifyMultipleUseDemandedBits.

There a few cases where we end up with extra register moves which I think we can accept in exchange for the increased ILP.

Differential Revision: https://reviews.llvm.org/D77804
2022-07-28 14:10:44 +01:00
Jay Foad
3eb2281bc0 [AMDGPU] Aggressively fold immediates in SIFoldOperands
Previously SIFoldOperands::foldInstOperand would only fold a
non-inlinable immediate into a single user, so as not to increase code
size by adding the same 32-bit literal operand to many instructions.

This patch removes that restriction, so that a non-inlinable immediate
will be folded into any number of users. The rationale is:
- It reduces the number of registers used for holding constant values,
  which might increase occupancy. (On the other hand, many of these
  registers are SGPRs which no longer affect occupancy on GFX10+.)
- It reduces ALU stalls between the instruction that loads a constant
  into a register, and the instruction that uses it.
- The above benefits are expected to outweigh any increase in code size.

Differential Revision: https://reviews.llvm.org/D114643
2022-05-18 10:19:35 +01:00
alex-t
d159b4444c [AMDGPU] Enable divergence predicates for negative inline constant subtraction
We have a pattern that undo sub x, c -> add x, -c canonicalization since c is more likely
 an inline immediate than -c. This patch enables it to select scalar or vector subtracion by the input node divergence.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D121360
2022-03-10 15:03:22 +03:00
Benjamin Kramer
0776f6e04d [LSV] Vectorize loads of vectors by turning it into a larger vector
Use shufflevector to do the subvector extracts. This allows a lot more
load merging on AMDGPU and also on NVPTX when <2 x half> is involved.

Differential Revision: https://reviews.llvm.org/D117219
2022-01-26 11:38:41 +01:00
Austin Kerbow
da067ed569 [AMDGPU] Set most sched model resource's BufferSize to one
Using a BufferSize of one for memory ProcResources will result in better
ILP since it more accurately models the dependencies between memory ops
and their consumers on an in-order processor. After this change, the
scheduler will treat the data edges from loads as blocking so that
stalls are guaranteed when waiting for data to be retreaved from memory.
Since we don't actually track waitcnt here, this should do a better job
at modeling their behavior.

Practically, this means that the scheduler will trigger the 'STALL'
heuristic more often.

This type of change needs to be evaluated experimentally. Preliminary
results are positive.

Fixes: SWDEV-282962

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D114777
2021-12-01 22:31:28 -08:00
Joe Nash
3ce1b9631a [AMDGPU] Switch PostRA sched to MachineSched
Use GCNHazardRecognizer in postra sched.
Updated tests for the new schedules.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D109536

Change-Id: Ia86ba2ae168f12fb34b4d8efdab491f84d936cde
2021-09-14 15:11:27 -04:00
alex-t
ed0f4415f0 [AMDGPU] Divergence-driven compare operations instruction selection
Description: This change enables the compare operations to be selected to SALU/VALU form
             dependent of the SDNode divergence flag.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D106079
2021-08-25 18:30:49 +03:00
Stanislav Mekhanoshin
4a3b055653 [AMDGPU] Fix flags of V_MOV_B64_PSEUDO
In particular it was not rematerializable.

Differential Revision: https://reviews.llvm.org/D105724
2021-07-09 12:49:28 -07:00
Stanislav Mekhanoshin
381ded345b [AMDGPU] Add S_MOV_B64_IMM_PSEUDO for wide constants
This is to allow 64 bit constant rematerialization. If a constant
is split into two separate moves initializing sub0 and sub1 like
now RA cannot rematerizalize a 64 bit register.

This gives 10-20% uplift in a set of huge apps heavily using double
precession math.

Fixes: SWDEV-292645

Differential Revision: https://reviews.llvm.org/D104874
2021-06-30 11:45:38 -07:00
hsmahesha
4905536086 Revert "[AMDGPU/MemOpsCluster] Implement new heuristic for computing max mem ops cluster size"
This reverts commit cc9d69385659be32178506a38b4f2e112ed01ad4.
2020-07-17 12:20:37 +05:30
Matt Arsenault
778351df77 Revert "[AMDGPU] Enable compare operations to be selected by divergence"
This reverts commit 521ac0b5cea02f629d035f807460affbb65ae7ad.

Reported to break thousands of piglit tests.
2020-06-24 11:21:30 -04:00
alex-t
521ac0b5ce [AMDGPU] Enable compare operations to be selected by divergence
Summary: Details: This patch enables SETCC to be selected to S_CMP_* if uniform and V_CMP_* if divergent.

Reviewers: rampitec, arsenm

Reviewed By: rampitec

Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, kerbowa, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D82194
2020-06-24 11:50:40 +03:00
Your Name
cc9d693856 [AMDGPU/MemOpsCluster] Implement new heuristic for computing max mem ops cluster size
Summary:
Make use of both the - (1) clustered bytes and (2) cluster length, to decide on
the max number of mem ops that can be clustered. On an average, when loads
are dword or smaller, consider `5` as max threshold, otherwise `4`. This
heuristic is purely based on different experimentation conducted, and there is
no analytical logic here.

Reviewers: foad, rampitec, arsenm, vpykhtin

Reviewed By: rampitec

Subscribers: llvm-commits, kerbowa, hiraditya, t-tye, Anastasia, tpr, dstuttard, yaxunl, nhaehnle, wdng, jvesely, kzhuravl, thakis

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D82393
2020-06-24 00:39:41 +05:30
hsmahesha
7410571ce9 Revert "[AMDGPU/MemOpsCluster] Implement new heuristic for computing max mem ops cluster size"
This reverts commit 40a632a335119fe3e8d5d500a5d2641998314ecb.
2020-06-09 19:27:17 +05:30
hsmahesha
40a632a335 [AMDGPU/MemOpsCluster] Implement new heuristic for computing max mem ops cluster size
Summary:
Make use of both the - (1) clustered bytes and (2) cluster length, to decide on
the max number of mem ops that can be clustered. On an average, when loads
are dword or smaller, consider `5` as max threshold, otherwise `4`. This heuristic
is purely based on different experimentation conducted, and there is no analytical
logic here.

Reviewers: foad, rampitec, arsenm, vpykhtin

Reviewed By: foad, rampitec

Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, Anastasia, t-tye, hiraditya, kerbowa, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D81085
2020-06-09 14:09:14 +05:30
Stanislav Mekhanoshin
71ed66d97f [AMDGPU] Make v4i64/v4f64/v8i64/v8f64 legal
We can produce such vectors in the Promote Alloca pass,
but we are unable to use movrel to operate it and lower
via scratch. Making it legal makes SI_INDIRECT patterns
work.

There is more work to do in subsequent changes:

1. We initialize m0 twice to access each dword. It shall
be possible to only do it once and increment base register
number instead.
2. We also need v16i64/v16f64 but these first need to be
added to tablegen.

Differential Revision: https://reviews.llvm.org/D79808
2020-05-12 16:05:12 -07:00
Simon Pilgrim
e91feeed21 [AMDGPU] Add ISD::FSHR -> ALIGNBIT support
This patch allows ISD::FSHR(i32) patterns to lower to ALIGNBIT instructions.

This improves test coverage of ISD::FSHR matching - x86 has both FSHL/FSHR instructions and we prefer FSHL by default.

Differential Revision: https://reviews.llvm.org/D76070
2020-03-12 20:16:57 +00:00
Jay Foad
b777e551f0 [MachineScheduler] Reduce reordering due to mem op clustering
Summary:
Mem op clustering adds a weak edge in the DAG between two loads or
stores that should be clustered, but the direction of this edge is
pretty arbitrary (it depends on the sort order of MemOpInfo, which
represents the operands of a load or store). This often means that two
loads or stores will get reordered even if they would naturally have
been scheduled together anyway, which leads to test case churn and goes
against the scheduler's "do no harm" philosophy.

The fix makes sure that the direction of the edge always matches the
original code order of the instructions.

Reviewers: atrick, MatzeB, arsenm, rampitec, t.p.northover

Subscribers: jvesely, wdng, nhaehnle, kristof.beyls, hiraditya, javed.absar, arphaman, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D72706
2020-01-14 19:19:02 +00:00
Matt Arsenault
1022c0dfde AMDGPU: Decompose all values to 32-bit pieces for calling conventions
This is the more natural lowering, and presents more opportunities to
reduce 64-bit ops to 32-bit.

This should also help avoid issues graphics shaders have had with
64-bit values, and simplify argument lowering in globalisel.

llvm-svn: 366578
2019-07-19 13:57:44 +00:00
Matt Arsenault
06eed42213 AMDGPU: Use getTargetConstant
Avoids creating an extra intermediate mov.

llvm-svn: 366340
2019-07-17 15:35:36 +00:00
Matt Arsenault
71dfb7ec5c AMDGPU: Make s34 the FP register
Make the FP register callee saved.

This is tricky because now the FP needs to be spilled in the prolog
relative to the incoming SP register, rather than the frame register
used throughout the rest of the function. I don't like how this
bypassess the standard mechanism for CSR spills just to get the
correct insert point. I may look for a better solution, since all CSR
VGPRs may also need to have all lanes activated. Another option might
be to make getFrameIndexReference change the base register if the
frame index is a CSR, and then try to figure out the right insertion
point in emitProlog.

If there is a free VGPR lane available for SGPR spilling, try to use
it for the FP. If that would require intrtoducing a new VGPR spill,
try to use a free call clobbered SGPR. Only fallback to introducing a
new VGPR spill as a last resort.

This also doesn't attempt to handle SGPR spilling with scalar stores.

llvm-svn: 365372
2019-07-08 19:03:38 +00:00
Matt Arsenault
2065680b47 AMDGPU: Don't use the default cpu in a few tests
Avoids unnecessary test changes in a future commit.

llvm-svn: 357539
2019-04-03 00:00:58 +00:00
Alexander Timofeev
36617f0160 [AMDGPU] Divergence driven instruction selection. Part 1.
Summary: This change is the first part of the AMDGPU target description
    change. The aim of it is the effective splitting the vector and scalar
    flows at the selection stage. Selection uses predicate functions based
    on the framework implemented earlier - https://reviews.llvm.org/D35267

    Differential revision: https://reviews.llvm.org/D52019

    Reviewers: rampitec

llvm-svn: 342719
2018-09-21 10:31:22 +00:00
Matt Arsenault
e719139b10 AMDGPU: Fix shifts for i128
llvm-svn: 339270
2018-08-08 16:58:33 +00:00