If more registers are needed for VAddr then the NSA format allows then the
final register can act as a contigous set of remaining addresses. Update
legalizer to pack register for this new format and allow instruction
selection to use NSA encoding when number of addresses exceeds max size.
Also update SIShrinkInstructions to handle partial NSA.
Differential Revision: https://reviews.llvm.org/D144034
Currently, raw_buffer_load_{i8,i16} and struct_buffer_load_{i8,i16}
intrinsics are lowered as buffer_load_{u8,u16}. This patch combines
buffer_load_{u8,u16} and sign extension instructions in order to
generate buffer_load_{i8,i16} instructions.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D144313
v_permlane16_b32 and v_permlanex16_b32 should not set abs and neg src
modifiers on any input, but they can set op_sel on src0 or src1 to
represent fi or bc when desired. The ISel patterns were setting
the src_modifier bits to -1, effectively setting abs and neg as well,
whenever it was intended to set op_sel, due to an error in ISel. ISel
should now correctly only set the op_sel bits.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D144519
These checks show optimized instructions if an operand is known to be
(partially) zero.
Change-Id: Ie2f6d0d3ee9d5b279d1f4c1dd0787492e39cc77a
Differential Revision: https://reviews.llvm.org/D140208
D139469 "[AMDGPU] Enable OMod on more VOP3 instructions" caused an
assertion failure when trying to fold into src2 of V_FMAC_F16. It would
temporarily convert the instruction to V_FMA_F16_gfx9 and add an opsel
operand, but if the fold still failed then it would forget to remove the
opsel operand.
Differential Revision: https://reviews.llvm.org/D144558
These tests show inefficient behavior that will be optimized by a
later change.
By using Known Bits Analysis, we can avoid unnecessary multiplications
or additions with 0.
Adds a new pass that removes functions
if they use features that are not supported on the current GPU.
This change is aimed at preventing crashes when building code at O0 that
uses idioms such as `if (ISA_VERSION >= N) intrinsic_a(); else intrinsic_b();`
where ISA_VERSION is not constexpr, and intrinsic_a is not selectable
on older targets.
This is a pattern that's used all over the ROCm device libs. The main
motive behind this change is to allow code using ROCm device libs
to be built at O0.
Note: the feature checking logic is done ad-hoc in the pass. There is no other
pass that needs (or will need in the foreseeable future) to do similar
feature-checking logic so I did not see a need to generalize the feature
checking logic yet. It can (and should probably) be generalized later and
moved to a TargetInfo-like class or helper file.
Reviewed By: arsenm, Joe_Nash
Differential Revision: https://reviews.llvm.org/D139000
These tests show inefficient behavior that will be optimized by a
later change.
By using Known Bits Analysis, we can avoid unnecessary multiplications
or additions with 0.
We do match source modifiers for f32 typed selects already, but the
combiner code was never informed of this.
A long time ago the documentation lied and stated that source
modifiers don't work for v_cndmask_b32 when they in fact do. We had a
bunch fo code operating under the assumption that they don't support
source modifiers, so we tried to move fnegs around to work around
this.
Gets a few small improvements here and there. The main hazard to watch
out for is infinite loops in the combiner since we try to move fnegs
up and down the DAG. For now, don't fold fneg directly into select.
The generic combiner does this for a restricted set of cases
when getNegatedExpression obviously shows an improvement for both
operands. It turns out to be trickier to avoid infinite looping the
combiner in conjunction with pulling out source modifiers, so
leave this for a later commit.
Ignore the multiple use heuristics of the default
implementation, and report cost based on inline immediates. This
is mostly interesting for -0 vs. 0. Gets a few small improvements.
fneg_fadd_0_f16 is a small regression. We could probably avoid this
if we handled folding fneg into div_fixup.
D139710 has added a metric to increase schedule's ILP while
staying within the same occupancy. Do not bother to apply this
metric to a region which is known to have spilling, it may result
in spilling to reappear after the previous stage and will do no
good if we already spilling anyway. It may also reduce compile
time a bit for such regions.
Fixes: SWDEV-377300
Differential Revision: https://reviews.llvm.org/D143934
This doesn't make sense as an option. fneg and fabs are bit
preserving by definition. If a target has some fneg or fabs
instruction that are not bitpreserving it's incorrect to lower
fneg/fabs to use it.
Requiring a bitcast to exist was unhelpful. The most basic cases
are always going to be a CopyFromReg or load, so they would need
a new cast inserted. Don't require a bitcast if it's a free
operation. I don't think this logic makes particularly much sense
(it seems to be imparting special interpretation of bitcast), but
this needs to be in sync with foldSignChangeInBitcast.
We should also get rid of this hasBitPreservingFPLogic hook. fabs/fneg
are bitpreserving or incorrectly implemented, so this should just be a
regular legality check.
While working on D143731 I hit a case where a build_vector with 2 undef operands could be generated (with one undef hidden behind a bitcast).
That made `reduceBuildVecTruncToBitCast` crash because it seems to assume there is at least one good operand.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D143886
Some subtargets use architected SGPRs for workgroup
IDs instead of the regular SGPRs. This patch enables
the support for the same and is guarded under the
subtarget feature FeatureArchitectedSGPRs.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D143707
Summary:
This is part of the leftover work for https://reviews.llvm.org/D143138.
In this work, we pass code object version as an argument to initialize target ID
and use it for targetID dump.
Reviewers: arsenm
Differential Revision
https://reviews.llvm.org/D143293
RFC to add a way to ignore COPY instructions when pattern-matching MIR in GISel.
- Add a new "GISelFlags" class to TableGen. Both `Pattern` and `PatFrags` defs can use it to alter matching behaviour.
- Flags start at zero and are scoped: the setter returns a `SaveAndRestore` object so that when the current scope ends, the flags are restored to their previous values. This allows child patterns to modify the flags without affecting the parent pattern.
- Child patterns always reuse the parent's pattern, but they can override its values. For more examples, see `GlobalISelEmitterFlags.td` tests.
- [AMDGPU] Use the IgnoreCopies flag in BFI patterns, which are known to be bothered by cross-regbank copies.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D136234
RegBankSelect can insert G_UNMERGE_VALUES in a lot of places which
left us with a lot of unmerge/merge pairs that could be simplified.
These often got in the way of pattern matching and made codegen
worse.
This patch:
- Makes the necessary changes to the merge/unmerge combines so they can run post RegBankSelect
- Adds relevant unmerge combines to the list of RegBankSelect combines for AMDGPU
- Updates some tablegen patterns that were missing explicit cross-regbank copies (V_BFI patterns were causing constant bus violations with this change).
This seems to be mostly beneficial for code quality.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D142192
Alignment of an alloca in IR can be lower than the preferred alignment
on purpose, but this override essentially treats the preferred
alignment as the minimum alignment.
The patch changes this behavior to always use the specified
alignment. If alignment is not set explicitly in LLVM IR, it is set to
DL.getPrefTypeAlign(Ty) in computeAllocaDefaultAlign.
Tests are changed as well: explicit alignment is increased to match
the preferred alignment if it changes output, or omitted when it is
hard to determine the right value (e.g. for pointers, some structs, or
weird types).
Differential Revision: https://reviews.llvm.org/D135462
Pre-GFX10 A16 modifier would imply G16. From GFX10 and onwards there are
separate instructions for 16bit gradients. This fixes the condition for
selecting G16 opcodes. Also stop adding G16 flag to instructions that do not
use gradients for GFX10 onwards.
If a condition register def happens past the newly created use
we do not properly update LIS. It has two problems:
1) We do not extend defining segment to the end of its block
marking it a live-out (this is regression after
https://reviews.llvm.org/rG09d38dd7704a52e8ad2d5f8f39aaeccf107f4c56)
2) We do not extend use segment to the beginning of the use block
marking it a live-in.
Fixes: SWDEV-379563
Differential Revision: https://reviews.llvm.org/D143302
We were assuming we could rely on the flat scratch init detection
to imply if there are possible flat addressed stack objects, which
doesn't work outside of a kernel. We should have a way to prove
if a given flat access can't access the stack.
We could use a not-stack parameter attribute to avoid
these splits.
Make the minimally correct change for GlobalISel; I'll address
this better in my larger patch to rewrite load and store legalization.
Fixes: SWDEV-218237
Summary:
This patch introduces a mechanism to check the code object version from the module flag, This avoids checking from command line.
In case the module flag is missing, we use the current default code object version supported in the compiler.
For tools whose inputs are not IR, we may need other approach (directive, for example) to check the code
object version, That will be in a separate patch later.
For LIT tests update, we directly add module flag if there is only a single code object version associated with all checks in one file.
In cause of multiple code object version in one file, we use the "sed" method to "clone" the checks to achieve the goal.
Reviewer: arsenm
Differential Revision:
https://reviews.llvm.org/D14313