GlobalISel already handles undefined workitem.id.{x,y,z} intrinsics,
SelDAG failed in AMDGPUISelLowering.cpp due to a failed assertion in
`AMDGPUTargetLowering::loadInputValue`: `Arg && "Attempting to load
missing argument"`. This commit changes the behavior of SelDAG to
instead use a zero constant.
This LLVM defect was identified via the AMD Fuzzing project.
`RegisterClassInfo` was supposed to be kept alive between pass runs,
which wasn't being done leading to recomputations increasing the compile
time.
Now the Impl class is a member of the legacy and new passes so that it
is not reconstructed on every pass run.
---------
Co-authored-by: Christudasan Devadasan <christudasan.devadasan@amd.com>
The lowering for GEP didn't properly support the case where the pointer
argument was being implicitly broadcast by a vector of indices. Fix
that.
---------
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
The demonte-scc transformation is no longer needed and the old test name
doesn't make sense anymore.
The test checks the generated assembly for different branch cases
* without metadata,
* with the same branch_weights on each edge and
* with a branch_weights that corresponds to the [[likely]] attribute
This was ultimately working around bugs in subregister handling
in peephole-opt. In the common case, it would give up on folding
anything into a subregister extract copy.
Remove the restriction that scheduling rematerialization candidates
cannot have virtual reg uses.
Currently, this only allows for virtual reg uses which are already live
at the rematerialization point, so bring in allUsesAvailableAt to check
for this condition. Because of this condition, the uses of the remats
will already be live in to the region, so the remat won't increase
live-in pressure.
Add an expensive check to check this condition.
This fixes the handling of subregister extract copies. This
will allow AMDGPU to remove its implementation of
shouldRewriteCopySrc, which exists as a 10 year old workaround
to this bug. peephole-opt-fold-reg-sequence-subreg.mir will
show the expected improvement once the custom implementation
is removed.
The copy coalescing processing here is overly abstracted
from what's actually happening. Previously when visiting
coalescable copy-like instructions, we would parse the
sources one at a time and then pass the def of the root
instruction into findNextSource. This means that the
first thing the new ValueTracker constructed would do
is getVRegDef to find the instruction we are currently
processing. This adds an unnecessary step, placing
a useless entry in the RewriteMap, and required skipping
the no-op case where getNewSource would return the original
source operand. This was a problem since in the case
of a subregister extract, shouldRewriteCopySource would always
say that it is useful to rewrite and the use-def chain walk
would abort, returning the original operand. Move the process
to start looking at the source operand to begin with.
This does not fix the confused handling in the uncoalescable
copy case which is proving to be more difficult. Some currently
handled cases have multiple defs from a single source, and other
handled cases have 0 input operands. It would be simpler if
this was implemented with isCopyLikeInstr, rather than guessing
at the operand structure as it does now.
There are some improvements and some regressions. The
regressions appear to be downstream issues for the most part. One
of the uglier regressions is in PPC, where a sequence of insert_subrgs
is used to build registers. I opened #125502 to use reg_sequence instead,
which may help.
The worst regression is an absurd SPARC testcase using a <251 x fp128>,
which uses a very long chain of insert_subregs.
We need improved subregister handling locally in PeepholeOptimizer,
and other pasess like MachineCSE to fix some of the other regressions.
We should handle subregister composes and folding more indexes
into insert_subreg and reg_sequence.
In contrast to SelectionDAG, GlobalISel created a new virtual register
for the return value of invariant.start, leaving subsequent users of the
invariant.start value with an undefined reference.
A minimal example:
```
%tmp = alloca i32, align 4, addrspace(5)
%tmpI = call ptr @llvm.invariant.start.p5(i64 4, ptr addrspace(5) %tmp) #3
call void @llvm.invariant.end.p5(ptr %tmpI, i64 4, ptr addrspace(5) %tmp) #3
store i32 %i, ptr %tmpI, align 4
```
Although the return value of invariant.start might not be intended for
any use beyond invariant.end (the fuzzer might not have created a
sensible situation here), an implicit definition of the corresponding
virtual register avoids a segfault in the target instruction selector
later.
This LLVM defect was identified via the AMD Fuzzing project.
True16 codegen for FPtoi1.
It seems tablegen figured out the pattern even without this pat in
place, and the fptoui/fptosi.ll already got the right transformation.
Aditionally updated the mir file and split it to pre-gfx11 and
post-gfx11.
Fix introducing stack usage if a bitcast source operand is an illegal
integer type cast to a legal vector type. This should cover more
situations, but this is the first one I noticed.
These special cases limit the width of memory operations we use for
lowering memcpy/memmove when the pointer arguments are 2-aligned or in
the LDS/GDS.
I found that performance in microbenchmarks on gfx90a, gfx1030, and
gfx1100 is better without this limitation.
Previously we only handled cases that looked like the high element
extract of a 64-bit shift. Generalize this to handle any multiple
indexing. I was hoping this would help avoid some regressions,
but it did not. It does however reduce the number of steps the DAG
takes to process these cases.
NFC-ish, I have yet to find an example where this changes the
final output.
In ceratin situations it is beneficial to wait for all outstanding
loads regardless of specific load's data we need. This may allow
to reduce a number of cache requests.
Fixes: SWDEV-511507
Oversight found by ISel fuzz effort. Assuming the argument is a
register, in some cases it can be an immediate. Tablegen's type for the
instruction is SSrc_b32, i.e. register or immediate fine. Added the
repro from the bug reporter as a test case - prior to this patch llvm
will assert in getReg.
Fixes SWDEV-508589
Given the rest of the pass just gives up when it needs to compose
subregisters, folding a subregister extract directly into a reg_sequence
is counterproductive. Later fold attempts in the function will give up
on the subregister operand, preventing looking up through the reg_sequence.
It may still be profitable to do these folds if we start handling
the composes. There are some test regressions, but this mostly
looks better.
Previously this combine would undo AMDGPU's new custom legalization of
wide vector shuffles into 2 element pieces. The comment also
states that this combine is only done before legalization,
but the case with a build_vector source was unconditional.
We probably don't want to do this if the multiple uses are full
scalarization of the vector, but this seems to work well enough.
Scalarizing extracts should have folded out pre-legalize.
This PR removes the old `nocapture` attribute, replacing it with the new
`captures` attribute introduced in #116990. This change is
intended to be essentially NFC, replacing existing uses of `nocapture`
with `captures(none)` without adding any new analysis capabilities.
Making use of non-`none` values is left for a followup.
Some notes:
* `nocapture` will be upgraded to `captures(none)` by the bitcode
reader.
* `nocapture` will also be upgraded by the textual IR reader. This is to
make it easier to use old IR files and somewhat reduce the test churn in
this PR.
* Helper APIs like `doesNotCapture()` will check for `captures(none)`.
* MLIR import will convert `captures(none)` into an `llvm.nocapture`
attribute. The representation in the LLVM IR dialect should be updated
separately.
After we fall back from GlobalISel to SDAG, the verifier gets called,
which calls getReservedRegs which uses SIMachineFunctionInfo::usesAGPRs
which caches the result of UsesAGPRs. Because we have just fallen-back
the function is empty and it incorrectly gets cached to false. This
patch makes sure we don't try to run the verifier whilst the function is
empty.