When collecting trivially rematerializable defs, skip any subreg defs. We do not want to sink these.
Differential Revision: https://reviews.llvm.org/D121874
This reverts commit c46aab01c002b7a04135b8b7f1f52d8c9ae23a58.
This evidently blocks compiling in some cases that used to work
before. I'm also not fully convinced this is the correct place to fix
this problem.
This change replaces the manual selection of buffer_atomic_cmpswap*
instructions in SelectionDAG and GlobalISel with a tblgen based
selection in BUFInstructions.td. This allows us to select the return and
no-return variants in tblgen.
Differential Revision: https://reviews.llvm.org/D121770
The fp32 packed math instructions are introduced in gfx90a.
If their vector register operands are not properly aligned, the
verifier should flag them. Currently, the verifier failed to
report it and the compiler ended up emitting a broken assembly.
This patch fixes that missed case in TII::verifyInstruction.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D121794
This patch adds initial argmemonly inference, by checking the underlying
objects of locations returned by MemoryLocation.
I think this should cover most cases, except function calls to other
argmemonly functions.
I'm not sure if there's a reason why we don't infer those yet.
Additional argmemonly can improve codegen in some cases. It also makes
it easier to come up with a C reproducer for 7662d1687b09 (already fixed,
but I'm trying to see if C/C++ fuzzing could help to uncover similar
issues.)
Compile-time impact:
NewPM-O3: +0.01%
NewPM-ReleaseThinLTO: +0.03%
NewPM-ReleaseLTO+g: +0.05%
https://llvm-compile-time-tracker.com/compare.php?from=067c035012fc061ad6378458774ac2df117283c6&to=fe209d4aab5b593bd62d18c0876732ddcca1614d&stat=instructions
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D121415
With opaque pointers, we cannot use the pointer element type to
determine the LocationSize for the AA query. Instead, -aa-eval
tests are now required to have an explicit load or store for any
pointer they want to compute alias results for, and the load/store
types are used to determine the location size.
This may affect ordering of results, and sorting within one result,
as the type is not considered part of the sorted string anymore.
To somewhat minimize the churn, printing still uses faux typed
pointer notation.
NFC. Update script does not behave right since the run lines have
identical output. Delete the duplicated check prefix added in
22cfbf7ecacdf7db47c2f65fe896bdf62ebcc0f3
NFC. Hasn't been updated since the update script started adding
check-next.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D121719
This mainly changes the handling of bitcasts to not check the types
being casted from/to -- we should only care about the actual
load/store types. The GEP handling is also changed to not care about
types, and just make sure that we get an offset corresponding to
a vector element.
This was a bit of a struggle for me, because this code seems to be
pretty sensitive to small changes. The end result seems to produce
strictly better results for the existing test coverage though,
because we can now deal with more situations involving bitcasts.
Differential Revision: https://reviews.llvm.org/D121371
We have a pattern that undo sub x, c -> add x, -c canonicalization since c is more likely
an inline immediate than -c. This patch enables it to select scalar or vector subtracion by the input node divergence.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D121360
SDNodes with different target flags may now be folded together
rightfully resulting in the assertion in the refineAlignment.
Folding nodes with different target flags may result in the
wrong load instructions produced at least on the AMDGPU.
Fixes: SWDEV-326805
Differential Revision: https://reviews.llvm.org/D121335
Summary:
In general, we need queue_ptr for aperture bases and trap handling,
and user SGPRs have to be set up to hold queue_ptr. In current implementation,
user SGPRs are set up unnecessarily for some cases. If the target has aperture
registers, queue_ptr is not needed to reference aperture bases. For trap
handling, if target suppots getDoorbellID, queue_ptr is also not necessary.
Futher, code object version 5 introduces new kernel ABI which passes queue_ptr
as an implicit kernel argument, so user SGPRs are no longer necessary for
queue_ptr. Based on the trap handling document:
https://llvm.org/docs/AMDGPUUsage.html#amdgpu-trap-handler-for-amdhsa-os-v4-onwards-table,
llvm.debugtrap does not need queue_ptr, we remove queue_ptr suport for llvm.debugtrap
in the backend.
Reviewers: sameerds, arsenm
Fixes: SWDEV-307189
Differential Revision: https://reviews.llvm.org/D119762
Flat can be merged with flat global since address cast is a no-op.
A combined memory operation needs to be promoted to flat.
Differential Revision: https://reviews.llvm.org/D120431
Add a new pass in the pre-ra AMDGPU scheduler to check if sinking trivially rematerializable defs that only has one use outside of the defining block will increase occupancy. If we can determine that occupancy can be increased, then rematerialize only the minimum amount of defs required to increase occupancy. Also re-schedule all regions that had occupancy matching the previous min occupancy using the new occupancy.
This is based off of the discussion in https://reviews.llvm.org/D117562.
The logic to determine the defs we should collect and determining if sinking would be beneficial is mostly the same. Main differences is that we are no longer limiting it to immediate defs and the def and use does not have to be part of a loop.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D119475
Currently the return address ABI registers s[30:31], which fall in the call
clobbered register range, are added as a live-in on the function entry to
preserve its value when we have calls so that it gets saved and restored
around the calls.
But the DWARF unwind information (CFI) needs to track where the return address
resides in a frame and the above approach makes it difficult to track the
return address when the CFI information is emitted during the frame lowering,
due to the involvment of understanding the control flow.
This patch moves the return address ABI registers s[30:31] into callee saved
registers range and stops adding live-in for return address registers, so that
the CFI machinery will know where the return address resides when CSR
save/restore happen during the frame lowering.
And doing the above poses an issue that now the return instruction uses undefined
register `sgpr30_sgpr31`. This is resolved by hiding the return address register
use by the return instruction through the `SI_RETURN` pseudo instruction, which
doesn't take any input operands, until the `SI_RETURN` pseudo gets lowered to the
`S_SETPC_B64_return` during the `expandPostRAPseudo()`.
As an added benefit, this patch simplifies overall return instruction handling.
Note: The AMDGPU CFI changes are there only in the downstream code and another
version of this patch will be posted for review for the downstream code.
Reviewed By: arsenm, ronlieb
Differential Revision: https://reviews.llvm.org/D114652
A load via pointer cast to constant will return true from
pointsToConstantMemory which is not necessarily so.
Fixes: SWDEV-326463
Differential Revision: https://reviews.llvm.org/D121172
Use TII::getRegClass to return a valid regclass or a nullptr
if the RC is unknown for a given OpIdx. This fixes a potential
crash occurred while getting the RC from a variadic instruction.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D120813
This test is to make sure the return address registers, if clobbered in the
function or when the function has calls, are save/restored irrespective of
whether the IPRA is enabled/disabled.
This test is found to be not save/restore the return address registers, when
clobbered in the function, with the corresponding downstream changes of D114652.
The test could not be reduced further as the register allocator needs enough
register pressure so that it allocates the return address registers as well.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D120922
It is not necessary to wait for all outstanding memory operations before
barriers on hardware that can back off of the barrier in the event of an
exception when traps are enabled. Add a new subtarget feature which
tracks which HW has this ability.
Reviewed By: #amdgpu, rampitec
Differential Revision: https://reviews.llvm.org/D120544
SIInstrInfo::FoldImmediate tried to delete move-immediate instructions
after folding them into their only use. This did not work because it was
checking hasOneNonDBGUse after doing the fold, at which point there
should be no uses. This seems to have no effect on codegen, it just
means less stuff for DCE to clean up later.
Differential Revision: https://reviews.llvm.org/D120815
In convertToThreeAddress handle VOP2 mac/fmac instructions with a
literal src0 operand, since these are prime candidates for
converting to madmk/fmamk.
Previously this would only happen if src0 (or src1) was a register
defined by a move-immediate instruction, but in many cases these
operands have already been folded because SIFoldOperands runs
before TwoAddressInstructionPass.
Differential Revision: https://reviews.llvm.org/D120736
This change adds the selection of no-return buffer_* instructions in
tblgen. The motivation for this is to get the no-return atomic isel
working without relying on post-isel hooks so that GlobalISel can start
selecting them (once GlobalISelEmitter allows no return atomic patterns
like how DAGISel does).
This change handles the selection of no-return mubuf_atomic_cmpswap in
tblgen without changing the extract_subreg generation for the return
variant. This handling was done by the post-isel hook.
Differential Revision: https://reviews.llvm.org/D120538