Switch DAGISel over to UniformityAnalysis, which was one of the last remaining users of the DivergenceAnalysis.
No explosions seen during internal testing so this looks like a smooth transition.
Reviewed By: sameerds
Differential Revision: https://reviews.llvm.org/D145918
Switch DAGISel over to UniformityAnalysis, which was one of the last remaining users of the DivergenceAnalysis.
No explosions seen during internal testing so this looks like a smooth transition.
Reviewed By: sameerds
Differential Revision: https://reviews.llvm.org/D145918
Post ISel, LDS variables are absolute values. Representing them as
such is simpler than the frame recalculation currently used to build assembler
tables from their addresses.
This is a precursor to lowering dynamic/external LDS accesses from non-kernel
functions.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D144221
The function makes liveness tests for the entire live register set for every instruction it passes by.
This becomes very slow on high RP regions such as ASAN enabled code.
Instead only uses of last tracked instruction should be tested and this greatly improves compilation time.
This patch revealed few bugs in SIFormMemoryClauses and PreRARematStage::sinkTriviallyRematInsts which should
be fixed first.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D136267
Values in SGPR and VGPR register are treated as unsigned by hardware.
When value in 32-bit SGPR or VGPR base can be negative calculate offset
using 32-bit add instructions, otherwise use
sgpr(unsigned) + vgpr(unsigned) + offset.
LoopStrengthReduce.cpp changes offsets to negative and in some
iterations value in SGPR or VGPR register could be negative.
Differential Revision: https://reviews.llvm.org/D144957
Values in VGPR register are treated as unsigned by hardware.
When value in 32-bit VGPR base can be negative calculate offset using
32-bit add instruction, otherwise use vgpr base(unsigned) + offset.
Does not affect case where whole offset comes from VGPR register
(immediate offset is 0).
LoopStrengthReduce.cpp changes offsets to negative and in some
iterations value in VGPR register could be negative.
Differential Revision: https://reviews.llvm.org/D144956
Values in SGPR register are treated as unsigned by hardware.
When value in 32-bit SGPR base can be negative calculate offset using
32-bit add instruction, otherwise use sgpr base(unsigned) + offset.
Does not affect case where whole offset comes from SGPR register
(immediate offset is 0).
LoopStrengthReduce.cpp changes offsets to negative and in some
iterations value in SGPR register could be negative.
Differential Revision: https://reviews.llvm.org/D144955
We reuse live registers after tracking one MBB as live-ins to the successor MBB
if the successor is only one but we don't check if the successor has other predecessors.
`A B`
` \ /`
` C`
A and B have one successor but C has live-ins defined by A and B and therefore should be
initialized using LIS.
This fixes 83 lit tests out if 420 with EXPENSIVE_CHECK enabled.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D136918
Memory models for gfx90a and gfx940 do not require buffer_wbl2
before the fence for acquire ordering, but we do insert the full
release.
Fixes: SWDEV-386785
Differential Revision: https://reviews.llvm.org/D145524
Currently, the codegen support for llvm.amdgcn.workgroup.id*
intrinsics are enabled only for compute kernels. In addition,
this patch enables their selection for compute shaders on
subtargets that have architected SGPRs.
Differential Revision: https://reviews.llvm.org/D145045
A little extra change was needed in UA because it didn't consider
InvokeInst and it made call-constexpr.ll assert.
Reviewed By: sameerds, arsenm
Differential Revision: https://reviews.llvm.org/D145358
Copy through the low bits and only apply an f32
copysign to the high half. This is effectively
what we do for codegen anyway, but this provides
some combine benefits. The cases involving constants
show some small improvements.
https://reviews.llvm.org/D142682
The fcopysign DAG operation, unlike the IR one, allows
different types for the sign and magnitude. We can reduce
the bitwidth of the high operand since only the sign bit matters.
The default combine only introduces mixed fcopysign
operand types from fpext/fptrunc. We effectively do this
already during selection, but doing it earlier in the combiner
should expose new combine opportunities (e.g. the existing tests
now eliminate the load of the low half of the double). Unfortunately
this isn't enough to handle the case I'm interested in just yet.
I initially attempted to select the source modifier from xor of
a sign mask. This proved to be more difficult since
foldBinOpIntoSelect does not consider free fneg of integers
and undoes the combine.
This just tweaks the fix for D145232 to make the limit more precise, so
that we could actually emit a delay of 3 SALU cycles (the maximum) if we
had any SALU instructions that required it.
Not sure why this was broken, but I was seeing this linker error:
ld64.lld: error: undefined symbol: (anonymous namespace)::AMDGPUInsertDelayAlu::DelayInfo::SALU_CYCLES_MAX
>>> referenced by AMDGPUInsertDelayAlu.cpp:129 (/Users/matt/src/llvm-project/llvm/lib/Target/AMDGPU/AMDGPUInsertDelayAlu.cpp:129)
Based on experimentation on gfx906,908,90a and 1030, wider global loads / stores are more performant than multiple narrower ones independent of alignment -- this is especially true when combining 8 bit loads / stores, in which case speedup was usually 2x across all alignments.
Differential Revision: https://reviews.llvm.org/D145170
Change-Id: I6ee6c76e6ace7fc373cc1b2aac3818fc1425a0c1
AMDGPU implements some handy code for expanding all constexpr
users of LDS globals. Extract the core logic into ReplaceConstant,
so that it can be reused elsewhere.
It is only used to infer the types of offset parameters in isel patterns,
which we can specify directly.
Reviewed By: piotr
Differential Revision: https://reviews.llvm.org/D144890
The matching for V_FMA_MIX was partially implemented with a C++
matcher (for fmas with 32 bit results and 16 bit inputs) and partially
in tablegen (for fmas with 16 bit results). Move the C++ matcher logic
into tablegen to make this more consistent and so we can remove the
duplication between SDAG and GISel.
Differential Revision: https://reviews.llvm.org/D144612
This is only used by CodeGen. Moving it out of AMDGPUBaseInfo simplifies
future changes to make some of it depend on the subtarget.
Differential Revision: https://reviews.llvm.org/D144650
If more registers are needed for VAddr then the NSA format allows then the
final register can act as a contigous set of remaining addresses. Update
legalizer to pack register for this new format and allow instruction
selection to use NSA encoding when number of addresses exceeds max size.
Also update SIShrinkInstructions to handle partial NSA.
Differential Revision: https://reviews.llvm.org/D144034
Image sample instructions that need more than 5 VGPRs for VAddr can use
partial NSA for NSA encoding format. VGPRs that can not fit into the
encoding are sequential after the last one.
This patch adds assembly and disassembly parts.
Differential Revision: https://reviews.llvm.org/D144033
D143174 lifted the artificial type restriction by promoting
offset to i32. This patch handles more cases: those involving
immediate offset in MUBUF.
Differential Revision: https://reviews.llvm.org/D144628
Currently, raw_buffer_load_{i8,i16} and struct_buffer_load_{i8,i16}
intrinsics are lowered as buffer_load_{u8,u16}. This patch combines
buffer_load_{u8,u16} and sign extension instructions in order to
generate buffer_load_{i8,i16} instructions.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D144313
v_permlane16_b32 and v_permlanex16_b32 should not set abs and neg src
modifiers on any input, but they can set op_sel on src0 or src1 to
represent fi or bc when desired. The ISel patterns were setting
the src_modifier bits to -1, effectively setting abs and neg as well,
whenever it was intended to set op_sel, due to an error in ISel. ISel
should now correctly only set the op_sel bits.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D144519
Moving this out of AMDGPUBaseInfo enforces that AMDGPUBaseInfo should
not be calling into GCNSubtarget.
Differential Revision: https://reviews.llvm.org/D144564
These checks show optimized instructions if an operand is known to be
(partially) zero.
Change-Id: Ie2f6d0d3ee9d5b279d1f4c1dd0787492e39cc77a
Differential Revision: https://reviews.llvm.org/D140208