ASMPrinter was relying on feature bits to setup extra SGRPs in the knerel
descriptor for the xnack_mask. This was broken for the dynamic XNACK "any" TID
setting which could cause user SGPRs to be clobbered if the number of SGPRs
reserved was near a granulated block boundary.
When XNACK was enabled this worked correctly in the ASMParser which meant some
kernels were only failing without "-save-temps".
Fixes: SWDEV-382764
Reviewed By: kzhuravl
Differential Revision: https://reviews.llvm.org/D145401
For D141247 - if that pattern was used by GISel it could cause constant bus limitation failures.
Just use inline immediates instead of S_MOV to avoid the issue.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D146131
Adds & uses a new `isDivergentUse` API in UA.
UniformityAnalysis now requires CycleInfo as well as the new temporal divergence API can query it.
-----
Original patch that adds `isDivergentUse` by @sameerds
The user of a temporally divergent value is marked as divergent in the
uniformity analysis. But the same user may also have been marked divergent for
other reasons, thus losing this information about temporal divergence. But some
clients need to specificly check for temporal divergence. This change restores
such an API, that already existed in DivergenceAnalysis.
Reviewed By: sameerds, foad
Differential Revision: https://reviews.llvm.org/D146018
The backend knew about `v_sat_pk_u8_i16` but never made use of it.
This patch adds selection patterns (DAG/GISel) for that instruction.
I think it'll be very rarely used, but at least it's possible to use it.
Solves #58266 (https://github.com/llvm/llvm-project/issues/58266)
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D144729
Cover more cases in preparation for making greater use
of fcmp based lowerings. Also add more tests for the inverted
cases. Test iszero | isnan test masks. We should probably just
generate every combination of test masks.
Try to remove extra bitcasts around logicops if we're dealing with illegal types
Fixes the regressions in D145939
Differential Revision: https://reviews.llvm.org/D146032
Switch DAGISel over to UniformityAnalysis, which was one of the last remaining users of the DivergenceAnalysis.
No explosions seen during internal testing so this looks like a smooth transition.
Reviewed By: sameerds
Differential Revision: https://reviews.llvm.org/D145918
AMDGPU ISel can add extra passes when expensive checks are enabled. This means the pipeline can be reordered and the checks may fail.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D146038
Switch DAGISel over to UniformityAnalysis, which was one of the last remaining users of the DivergenceAnalysis.
No explosions seen during internal testing so this looks like a smooth transition.
Reviewed By: sameerds
Differential Revision: https://reviews.llvm.org/D145918
[DAGCombiner] handle more store value forwarding
When lowering calls on target like PPC, some stack loads
will be generated for by value parameters. Node CALLSEQ_START
prevents such loads from being combined.
Suggested by @RolandF, this patch removes the unnecessary
loads for the byval parameter by extending ForwardStoreValueToDirectLoad
Reviewed By: nemanjai, RolandF
Differential Revision: https://reviews.llvm.org/D138899
Post ISel, LDS variables are absolute values. Representing them as
such is simpler than the frame recalculation currently used to build assembler
tables from their addresses.
This is a precursor to lowering dynamic/external LDS accesses from non-kernel
functions.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D144221
Try to more aggressively narrow masks of extended values.
This is mainly for cases where the mask is trying to zero out any_extended upper bits, assuming we can zext/trunc the values for free.
This catches a few actual missed folds, as well as helps canonicalize a number of other cases which were being caught in isel etc.
Differential Revision: https://reviews.llvm.org/D145866
We don't do this transform in InstCombine in general case for arbitrary values, because cost of
AND and 2 ICMP's isn't higher than of MIN and ICMP. However, LICM also has a notion
about the loop structure. This transform becomes profitable if `A` and `B` are loop-invariant and
`X` is not: by doing this, we can compute min outside the loop.
Differential Revision: https://reviews.llvm.org/D143726
Reviewed By: nikic
The function makes liveness tests for the entire live register set for every instruction it passes by.
This becomes very slow on high RP regions such as ASAN enabled code.
Instead only uses of last tracked instruction should be tested and this greatly improves compilation time.
This patch revealed few bugs in SIFormMemoryClauses and PreRARematStage::sinkTriviallyRematInsts which should
be fixed first.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D136267
Values in SGPR and VGPR register are treated as unsigned by hardware.
When value in 32-bit SGPR or VGPR base can be negative calculate offset
using 32-bit add instructions, otherwise use
sgpr(unsigned) + vgpr(unsigned) + offset.
LoopStrengthReduce.cpp changes offsets to negative and in some
iterations value in SGPR or VGPR register could be negative.
Differential Revision: https://reviews.llvm.org/D144957
Values in VGPR register are treated as unsigned by hardware.
When value in 32-bit VGPR base can be negative calculate offset using
32-bit add instruction, otherwise use vgpr base(unsigned) + offset.
Does not affect case where whole offset comes from VGPR register
(immediate offset is 0).
LoopStrengthReduce.cpp changes offsets to negative and in some
iterations value in VGPR register could be negative.
Differential Revision: https://reviews.llvm.org/D144956
Values in SGPR register are treated as unsigned by hardware.
When value in 32-bit SGPR base can be negative calculate offset using
32-bit add instruction, otherwise use sgpr base(unsigned) + offset.
Does not affect case where whole offset comes from SGPR register
(immediate offset is 0).
LoopStrengthReduce.cpp changes offsets to negative and in some
iterations value in SGPR register could be negative.
Differential Revision: https://reviews.llvm.org/D144955
Memory models for gfx90a and gfx940 do not require buffer_wbl2
before the fence for acquire ordering, but we do insert the full
release.
Fixes: SWDEV-386785
Differential Revision: https://reviews.llvm.org/D145524
Currently, the codegen support for llvm.amdgcn.workgroup.id*
intrinsics are enabled only for compute kernels. In addition,
this patch enables their selection for compute shaders on
subtargets that have architected SGPRs.
Differential Revision: https://reviews.llvm.org/D145045
A little extra change was needed in UA because it didn't consider
InvokeInst and it made call-constexpr.ll assert.
Reviewed By: sameerds, arsenm
Differential Revision: https://reviews.llvm.org/D145358
Copy through the low bits and only apply an f32
copysign to the high half. This is effectively
what we do for codegen anyway, but this provides
some combine benefits. The cases involving constants
show some small improvements.
https://reviews.llvm.org/D142682
The fcopysign DAG operation, unlike the IR one, allows
different types for the sign and magnitude. We can reduce
the bitwidth of the high operand since only the sign bit matters.
The default combine only introduces mixed fcopysign
operand types from fpext/fptrunc. We effectively do this
already during selection, but doing it earlier in the combiner
should expose new combine opportunities (e.g. the existing tests
now eliminate the load of the low half of the double). Unfortunately
this isn't enough to handle the case I'm interested in just yet.
I initially attempted to select the source modifier from xor of
a sign mask. This proved to be more difficult since
foldBinOpIntoSelect does not consider free fneg of integers
and undoes the combine.
Based on experimentation on gfx906,908,90a and 1030, wider global loads / stores are more performant than multiple narrower ones independent of alignment -- this is especially true when combining 8 bit loads / stores, in which case speedup was usually 2x across all alignments.
Differential Revision: https://reviews.llvm.org/D145170
Change-Id: I6ee6c76e6ace7fc373cc1b2aac3818fc1425a0c1
Add tests for more complicated scratch load and store patterns.
Includes:
- sign and zero extending loads of i8 and i16 to i32 into 32-bit register
- D16 instructions that affect only high or low 16 bits of 32-bit register
- D16 sign and zero extending loads of i8 to i16 into high or low 16 bits
of 32-bit register
- D16 loads of i16 to high or low 16 bits of 32-bit register
- D16 stores of i8 and i16 from high 16 bits of 32-bit register
Differential Revision: https://reviews.llvm.org/D145081