Removed definitions of vectorizeBasicBlock and VectorizeConfig
(possibly a remnant from the BBVectorize pass that was removed
way back in 2017).
Also reduced amount of include dependencies to Transforms/Vectorize.h.
In `lib/CodeGen/PrologEpilogInserter.cpp` file, `RS` is assigned via `RS = TRI->requiresRegisterScavenging(MF) ? new RegScavenger() : nullptr;`. This means that `RS` can be `nullptr`. While executing the `TFI->processFunctionBeforeFrameFinalized(MF, RS);`, the `RS` can be dereferenced in the call `RS->enterBasicBlock(MBB);` in file `lib/Target/AMDGPU/SIFrameLowering.cpp`
Reviewed By: skan, arsenm
Differential Revision: https://reviews.llvm.org/D146791
In some cases, breaking large PHIs can very negatively affect
performance (3x more instructions observed in a particular test case).
This patch adds some basic profitability heuristics to help with some of these issues without affecting the "good" cases.
e.g. avoid breaking PHIs if it causes back-and-forth between vector/scalar form for no good reason.
Fixes SWDEV-392803
Fixes SWDEV-393781
Fixes SWDEV-394228
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D147786
PAL Metadata 3.0 introduces an explicit structure in metadata for the
programmable registers written out by the compiler backend.
Rather than using opaque registers which can change between different
architectures and requires encoding the bitfield information in the backend,
which may change between versions.
This is the initial minimal implementation that enables the use of PAL Metadata
3.0.
The change itself should be NFC for non-PAL, although the way RSRC2 register is
handled has been changed slightly.
The test is fairly minimal, but checks that the metadata format looks as
expected and verifies a couple of special cases such as tgid_[xyz]_en handling
and PsInputAddr/Ena which also change to explicit fields.
Differential Revision: https://reviews.llvm.org/D147143
The peephole optimizer tries to replace
```
%n:sgpr_32 = S_MOV_B32 x
$scc = COPY %n
```
with a `S_MOV_B32` directly into `$scc`.
This crashes because `S_MOV_B32` cannot take `$scc` as input.
We currently generate code like this from GlobalISel when lowering a
G_BRCOND with a constant condition. We should probably look into
removing this kind of branch altogether, but until then we should at
least not crash.
This patch fixes the issue by making sure we don't apply the peephole
optimization when trying to move into a physical register that
doesn't belong to the correct register class.
Differential Revision: https://reviews.llvm.org/D148117
With architected SGPRs, workgroup IDs are passed into a compute shader
in TTMP registers. Allow for this in AMDGPUResourceUsageAnalysis instead
of failing an assertion.
Differential Revision: https://reviews.llvm.org/D148239
Similar to the getArithmeticReductionCost / getExtendedReductionCost calls (which really don't need to use std::optional<>).
This will be necessary to correct recognize fast/nnan fmax/fmul reductions which can avoid nan handling - which will allow us to remove the fmax/fmin special case in X86TTIImpl::getMinMaxCost and use getIntrinsicInstrCost like we do for integer reductions (63c3895327839ba5b57f5b99ec9e888abf976ac6).
Differential Revision: https://reviews.llvm.org/D148149
Apparently it was used to work around some issue that has been fixed.
Removing it helps with high scratch usage observed in some cases due to failed alloca promotion.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D145586
- As the address space cast may not be valid on a specific target,
`addrspacecast` is not handled when an `alloca` is able to be replaced
with the source of memcpy/memmove. This patch addresses that by
querying a target hook on whether that address space cast is valid.
For example, on most GPU targets, the cast from a global pointer to a
generic pointer is valid.
- If that cast is allowedd (by querying `isValidAddrSpaceCast`), the
replacement is enhanced to handle that `addrspacecast` as well.
Reviewed By: yaxunl
Differential Revision: https://reviews.llvm.org/D147025
The math libraries have a lot of code that performs
manual sign bit operations by bitcasting doubles to int2
and doing bithacking on them. This is a bad canonical form
we should rewrite to use high level sign operations directly
on double. To avoid codegen regressions, we need to do a better
job moving fnegs to operate only on the high 32-bits.
This is only halfway to fixing the real case.
The GFX11 NGG Streamout Instructions perform atomic operations on
dedicated registers. At the moment, they lack machine memory operands,
which causes the si-memory-legalizer pass to treat them conservatively
and introduce several unnecessary waits and cache invalidations.
This patch introduces a new address space to represent these special
registers and teaches instruction selection to add memory operands with
this new address space to DS_ADD/SUB_GS_REG_RTN.
Since this address space is meant to be compiler-internal, we move it
up a bit from the other address spaces and give it the number 128.
According to the LLVM Language Reference, address space numbers can go
all the way up to 2^24, but I'm not sure how well this is supported in
practice [1], so using a smaller number seems safer.
[1] 0107513fe7/llvm/utils/TableGen/IntrinsicEmitter.cpp (L401)
Differential Revision: https://reviews.llvm.org/D146031
Use an `OptimizationRemark` for them even though it's not really an
optimization. It just integrates better with the other diagnostics
(enabling is easy with `-pass-remark`).
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D147703
Summary:
This is to avoid using the callee saved registers for the return address
of the tail call return instruction.
Reviewers:
arsenm, cdevadas
Differential Revision:
https://reviews.llvm.org/D147096
We set AddedComplexity = 100 for s_load patterns to prefer them over
global loads, but for s_buffer_load patterns there is no need to do
this and it was quietly overriding the AddedComplexity of each
individual GCNPat that is defined inside SMLoad_Pattern (but in practice
that did not appear to make any difference).
Differential Revision: https://reviews.llvm.org/D145396
However the imported rules can not be used for now because Global ISel
selectImpl() seems has some bug/limitation to create a illegl COPY
from VGPR to SGPR. So currently workaround this by not auto selecting these
patterns.
Fixes#61468
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D147780
GlobalISel happens to insert some constant materializes before SI_END_CF
in one test. These need to be excluded from AliveBlocks since they
are defined in the original block and used in the split block,
so they aren't fully alive through either block.
The case where the value defined in the first block which was originally used
in a later block is still broken.
Avoids a verifier error in a future patch.
This was causing two test failures when I applied D129208 to enable
extra verification of LiveIntervals:
LLVM :: CodeGen/AMDGPU/optimize-negated-cond-exec-masking-wave32.mir
LLVM :: CodeGen/AMDGPU/optimize-negated-cond-exec-masking.mir
Differential Revision: https://reviews.llvm.org/D147721
set_inactive is actually a kind of operation that is passing certain
value from active threads to inactive threads. In later WWM operation,
the activated threads which were disabled before would read such
values passed to them by set_inactive operation. So I think the
set_inactive is a convergent operation.
Differential Revision: https://reviews.llvm.org/D147683
All users of MCCodeEmitter::encodeInstruction use a raw_svector_ostream
to encode the instruction into a SmallVector. The raw_ostream however
incurs some overhead for the actual encoding.
This change allows an MCCodeEmitter to directly emit an instruction into
a SmallVector without using a raw_ostream and therefore allow for
performance improvments in encoding. A default path that uses existing
raw_ostream implementations is provided.
Reviewed By: MaskRay, Amir
Differential Revision: https://reviews.llvm.org/D145791
The inverse ballot intrinsic takes in a boolean mask for all lanes and
returns the boolean for the current lane. See SPIR-V's
`subgroupInverseBallot()` in the [[ https://github.com/KhronosGroup/GLSL/blob/master/extensions/khr/GL_KHR_shader_subgroup.txt | GL_KHR_shader_subgroup extension ]].
This allows decision making via branch and select instructions with a manually
manipulated mask.
Implemented in GlobalISel and SelectionDAG, since currently both are supported.
The SelectionDAG required pseudo instructions to use the custom inserter.
The boolean mask needs to be uniform for all lanes.
Therefore we expect SGPR input. In case the source is in a
VGPR, we insert one or more `v_readfirstlane` instructions.
Reviewed By: nhaehnle
Differential Revision: https://reviews.llvm.org/D146287
Pass disabled since approximately D104962 for miscompiling openmp
The functions under ReplaceConstant miscompile phis as noted in D112717 and
have no users in tree other than the disabled pass. It seems likely it has no
users out of tree.
Deletes the test cases associated with the disabled pass as well.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D147586
The premise here is to allow non-kernel functions to locate external LDS variables without using LDS or extra magic SGPRs to do so.
1/ First it crawls the callgraph to work out which external LDS variables are reachable from a given kernel
2/ Then it creates a new `extern char[0]` variable for each kernel, which will alias all the other extern LDS variables because that's the documented behaviour of these variables
3/ The address of that variable is written to a lookup table. The global variable is tagged with metadata to track what address it was allocated at by codegen
4/ The assembler builds the lookup table using the metadata
5/ Any non-kernel functions use the same magic intrinsic used by table lookups of non-dynamic LDS variables to find the address to use
Heavy overlap with the code paths taken for other lowering, in particular the same intrinsic is used to pass the dynamic scope information through the same sgpr as for table lookups of static LDS.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D144233
Summary:
This is to avoid using the callee saved registers for the return address of the tail call return instruction.
Reviewers:
arsenm, cdevadas
Differential Revision:
https://reviews.llvm.org/D147096
There is no getNullValue in ConstantFP. Due to inheritance, we're calling
Constant::getNullValue which handles any type including FP.
Since we already know we want an FP constant we can use ConstantFP::getZero
which might be faster and is a more readable name for an FP zero.
Similar to the existing SelectionDAG::SplitVector helper, this helper creates the EXTRACT_ELEMENT nodes for the LO/HI halves of the scalar source.
Differential Revision: https://reviews.llvm.org/D147264
Various Real classes took an OffsetMode parameter, but only used it
to extract the suffix for the name of the corresponding pseudo. I found
this confusing because you couldn't usefully define and use a different
OffsetMode here, e.g. one with different operand types to affect how the
instruction was printed.
Overall I think it's simpler to just pass in the suffixed pseudo name
directly.
Differential Revision: https://reviews.llvm.org/D147242
There should be no need to reserve all SGPR hi16/lo16 halves, or all
AGPR hi16 halves. This should be done by marking the corresponding
register classes as not allocatable instead.
Differential Revision: https://reviews.llvm.org/D147158
This reverts commit 64b45db34a0cd979dae9ca3016e9da517e57b987.
Reason: the patterns are wrong which can result in a miscompilation.
However, fixing the pattern is not trivial due to how i8 values
are handled, and due to the additional type-checking performed by
D147127: trunc/smax/smin are all defined as int ops in the DAG
despite them working on vectors too.
As this is not a much-needed pattern, I prefer reverting for now
until I can find time to properly rewrite the pattern.
Summary:
This is to avoid using the callee saved registers for the return address of the tail call return instruction.
Reviewers:
arsenm, cdevadas
Differential Revision:
https://reviews.llvm.org/D147096