SIInsertWaitcnts inserts waitcnt instructions to resolve data
dependencies. The GFX10+ vscnt (VMEM store count) counter is never used
in this way. It is only used to resolve memory dependencies, and that is
handled by SIMemoryLegalizer. Hence there is no need to conservatively
wait for vscnt to be 0 on function entry and before returns.
Differential Revision: https://reviews.llvm.org/D153537
Occupancy is expressed as waves per SIMD. This means that we need to
take into account the number of SIMDs per "CU" or, to be more precise,
the number of SIMDs over which a workgroup may be distributed.
getOccupancyWithLocalMemSize was wrong because it didn't take SIMDs
into account at all.
At the same time, we need to take into account that WGP mode offers
access to a larger total amount of LDS, since this can affect how
non-power-of-two LDS allocations are rounded. To make this work
consistently, we distinguish between (available) local memory size and
addressable local memory size (which is always limited by 64kB on
gfx10+, even with WGP mode).
This change results in a massive amount of test churn. A lot of it is
caused by the fact that the default work group size is 1024, which means
that (due to rounding effects) the default occupancy on older hardware
is 8 instead of 10, which affects scheduling via register pressure
estimates. I've adjusted most tests by just running the UTC tools, but
in some cases I manually changed the work group size to 32 or 64 to make
sure that work group size chunkiness has no effect.
Differential Revision: https://reviews.llvm.org/D139468
This patch allows SimplifyDemandedBits to call SimplifyMultipleUseDemandedBits in cases where the ISD::SRL source operand has other uses, enabling us to peek through the shifted value if we don't demand all the bits/elts.
This is another step towards removing SelectionDAG::GetDemandedBits and just using TargetLowering::SimplifyMultipleUseDemandedBits.
There a few cases where we end up with extra register moves which I think we can accept in exchange for the increased ILP.
Differential Revision: https://reviews.llvm.org/D77804
Add GFX11 test coverage to a bunch of tests where it was easy to do so,
mostly because the checks are autogenerated and/or GFX11 can share the
same checks as GFX10.
Differential Revision: https://reviews.llvm.org/D129295
Fold immediates regardless of how many uses they have. This is expected
to increase overall code size, but decrease register usage.
Differential Revision: https://reviews.llvm.org/D114644
Code using indirect calls is broken without this, and there isn't
really much value in supporting the old attempt to vary the argument
placement based on uses. This resulted in more argument shuffling code
anyway.
Also have the option stop implying all inputs need to be passed. This
will no rely on the amdgpu-no-* attributes to avoid passing
unnecessary values.
Change VOP_PAT_GEN to default to not generating an instruction selection
pattern for the VOP2 (e32) form of an instruction, only for the VOP3
(e64) form. This allows SIFoldOperands maximum freedom to fold copies
into the operands of an instruction, before SIShrinkInstructions tries
to shrink it back to the smaller encoding.
This affects the following VOP2 instructions:
v_min_i32
v_max_i32
v_min_u32
v_max_u32
v_and_b32
v_or_b32
v_xor_b32
v_lshr_b32
v_ashr_i32
v_lshl_b32
A further cleanup could simplify or remove VOP_PAT_GEN, since its
optional second argument is never used.
Differential Revision: https://reviews.llvm.org/D114252
Use GCNHazardRecognizer in postra sched.
Updated tests for the new schedules.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D109536
Change-Id: Ia86ba2ae168f12fb34b4d8efdab491f84d936cde
This changes the lowering of saddsat and ssubsat so that instead of
using:
r,o = saddo x, y
c = setcc r < 0
s = c ? INTMAX : INTMIN
ret o ? s : r
into using asr and xor to materialize the INTMAX/INTMIN constants:
r,o = saddo x, y
s = ashr r, BW-1
x = xor s, INTMIN
ret o ? x : r
https://alive2.llvm.org/ce/z/TYufgD
This seems to reduce the instruction count in most testcases across most
architectures. X86 has some custom lowering added to compensate for
cases where it can increase instruction count.
Differential Revision: https://reviews.llvm.org/D105853
the compilation time and there is no case for which we see any improvement in
performance. This patch removes this pass and its associated test cases from
the tree.
Differential Revision: https://reviews.llvm.org/D101313
Change-Id: I0599169a7609c19a887f8d847a71e664030cc141