This avoids regressions in a future AMDGPU commit. Previously we
would have a build_vector (extract_vector_elt x), undef with free
access to the elements bloated into a shuffle of one element + undef,
which has much worse combine support than the extract.
Alternatively could check aggressivelyPreferBuildVectorSources, but
I'm not sure it's really different than isExtractVecEltCheap.
Only select to a VGPR if it's trivally used in VGPR only contexts.
This fixes mishandling frame indexes used in SGPR only contexts,
like inline assembly constraints.
This is suboptimal in the common case where the frame index
is transitively used by only VALU ops. We make up for this by later
folding the copy to VALU plus scalar op in SIFoldOperands.
Allocating wwm-registers and per-thread VGPR operands
together imposes many challenges in the way the
registers are reused during allocation. There are
times when regalloc reuses the registers of regular
VGPRs operations for wwm-operations in a small range
leading to unwantedly clobbering their inactive lanes
causing correctness issues that are hard to trace.
This patch splits the VGPR allocation pipeline further
to allocate wwm-registers first and the regular VGPR
operands in a separate pipeline. The splitting would
ensure that the physical registers used for wwm
allocations won't take part in the next allocation
pipeline to avoid any such clobbering.
This is still relying on the manual code for splitting 64-bit
constants, and handling pointers.
We were missing some of the tablegen patterns for all immediate types,
so this has some side effect DAG path improvements. This also reduces
the diff in the 2 selector outputs.
This reverts commit adaff46d087799072438dd744b038e6fd50a2d78.
Drop the -O3 checks from default-attributes.hip. I don't know why they
are different on some bots but reverting this is far too disruptive.
Removing it from the codegen pipeline induces a lot of test churn
because llc is no longer optimizing out implicit arguments to kernels.
Mostly mechanical, but there are some creative test updates. I preferred
to take the changes as-is in tests where the ABI isn't relevant. In
cases where it's more relevant, or the optimize out logic was too
ingrained in the test, I pre-run the optimization. Some cases manually
add attributes to disable inputs.
At the moment, the emergency spill slot is a fixed object for entry
functions and chain functions, and a regular stack object otherwise.
This patch adopts the latter behaviour for entry/chain functions too. It
seems this was always the intention [1] and it will also save us a bit
of stack space in cases where the first stack object has a large
alignment.
[1]
34c8b835b1
Similar to 806761a7629df268c8aed49657aeccffa6bca449.
For IR files without a target triple, -mtriple= specifies the full
target triple while -march= merely sets the architecture part of the
default target triple, leaving a target triple which may not make sense,
e.g. amdgpu-apple-darwin.
Therefore, -march= is error-prone and not recommended for tests without
a target triple. The issue has been benign as we recognize
$unknown-apple-darwin as ELF instead of rejecting it outrightly.
This patch changes AMDGPU tests to not rely on the default
OS/environment components. Tests that need fixes are not changed:
```
LLVM :: CodeGen/AMDGPU/fabs.f64.ll
LLVM :: CodeGen/AMDGPU/fabs.ll
LLVM :: CodeGen/AMDGPU/floor.ll
LLVM :: CodeGen/AMDGPU/fneg-fabs.f64.ll
LLVM :: CodeGen/AMDGPU/fneg-fabs.ll
LLVM :: CodeGen/AMDGPU/r600-infinite-loop-bug-while-reorganizing-vector.ll
LLVM :: CodeGen/AMDGPU/schedule-if-2.ll
```
PR #66334 tried to renumber slot indexes before register allocation, but
the numbering was still affected by list entries for instructions which
had been erased. Fix this to make the register allocator's live range
length heuristics even less dependent on the history of how instructions
have been added to and removed from SlotIndexes's maps.
The motivation for this change is a workload generated by the XLA compiler
targeting nvidia GPUs.
This kernel has a few hundred i8 loads and stores. Merging is critical for
performance.
The current LSV doesn't merge these well because it only considers instructions
within a block of 64 loads+stores. This limit is necessary to contain the
O(n^2) behavior of the pass. I'm hesitant to increase the limit, because this
pass is already one of the slowest parts of compiling an XLA program.
So we rewrite basically the whole thing to use a new algorithm. Before, we
compared every load/store to every other to see if they're consecutive. The
insight (from tra@) is that this is redundant. If we know the offset from PtrA
to PtrB, then we don't need to compare PtrC to both of them in order to tell
whether C may be adjacent to A or B.
So that's what we do. When scanning a basic block, we maintain a list of
chains, where we know the offset from every element in the chain to the first
element in the chain. Each instruction gets compared only to the leaders of
all the chains.
In the worst case, this is still O(n^2), because all chains might be of length
1. To prevent compile time blowup, we only consider the 64 most recently used
chains. Thus we do no more comparisons than before, but we have the potential
to make much longer chains.
This rewrite affects many tests. The changes to tests fall into two
categories.
1. The old code had what appears to be a bug when deciding whether a misaligned
vectorized load is fast. Suppose TTI reports that load <i32 x 4> align 4
has relative speed 1, and suppose that load i32 align 4 has relative speed
32.
The intent of the code seems to be that we prefer the scalar load, because
it's faster. But the old code would choose the vectorized load.
accessIsMisaligned would set RelativeSpeed to 0 for the scalar load (and not
even call into TTI to get the relative speed), because the scalar load is
aligned.
After this patch, we will prefer the scalar load if it's faster.
2. This patch changes the logic for how we vectorize. Usually this results in
vectorizing more.
Explanation of changes to tests:
- AMDGPU/adjust-alloca-alignment.ll: #1
- AMDGPU/flat_atomic.ll: #2, we vectorize more.
- AMDGPU/int_sideeffect.ll: #2, there are two possible locations for the call to @foo, and the pass is brittle to this. Before, we'd vectorize in case 1 and not case 2. Now we vectorize in case 2 and not case 1. So we just move the call.
- AMDGPU/adjust-alloca-alignment.ll: #2, we vectorize more
- AMDGPU/insertion-point.ll: #2 we vectorize more
- AMDGPU/merge-stores-private.ll: #1 (undoes changes from git rev 86f9117d476, which appear to have hit the bug from #1)
- AMDGPU/multiple_tails.ll: #1
- AMDGPU/vect-ptr-ptr-size-mismatch.ll: Fix alignment (I think related to #1 above).
- AMDGPU CodeGen: I have difficulty commenting on these changes, but many of them look like #2, we vectorize more.
- NVPTX/4x2xhalf.ll: Fix alignment (I think related to #1 above).
- NVPTX/vectorize_i8.ll: We don't generate <3 x i8> vectors on NVPTX because they're not legal (and eventually get split)
- X86/correct-order.ll: #2, we vectorize more, probably because of changes to the chain-splitting logic.
- X86/subchain-interleaved.ll: #2, we vectorize more
- X86/vector-scalar.ll: #2, we can now vectorize scalar float + <1 x float>
- X86/vectorize-i8-nested-add-inseltpoison.ll: Deleted the nuw test because it was nonsensical. It was doing `add nuw %v0, -1`, but this is equivalent to `add nuw %v0, 0xffff'ffff`, which is equivalent to asserting that %v0 == 0.
- X86/vectorize-i8-nested-add.ll: Same as nested-add-inseltpoison.ll
Differential Revision: https://reviews.llvm.org/D149893
Occupancy is expressed as waves per SIMD. This means that we need to
take into account the number of SIMDs per "CU" or, to be more precise,
the number of SIMDs over which a workgroup may be distributed.
getOccupancyWithLocalMemSize was wrong because it didn't take SIMDs
into account at all.
At the same time, we need to take into account that WGP mode offers
access to a larger total amount of LDS, since this can affect how
non-power-of-two LDS allocations are rounded. To make this work
consistently, we distinguish between (available) local memory size and
addressable local memory size (which is always limited by 64kB on
gfx10+, even with WGP mode).
This change results in a massive amount of test churn. A lot of it is
caused by the fact that the default work group size is 1024, which means
that (due to rounding effects) the default occupancy on older hardware
is 8 instead of 10, which affects scheduling via register pressure
estimates. I've adjusted most tests by just running the UTC tools, but
in some cases I manually changed the work group size to 32 or 64 to make
sure that work group size chunkiness has no effect.
Differential Revision: https://reviews.llvm.org/D139468
A bitcast of <10 x i32> to <5 x i64> was ending up on the
stack. Instead of doing that, handle the case where the new type
doesn't evenly divide but the elements do. Extract the individual
elements and pad with undef.
Avoids stack usage for bitcasts involving <5 x i64>. In some of these
cases, later optimizations actually eliminated the stack objects but
left behind the unused temporary stack object to final emission.
Fixes: SWDEV-377548
This was disabled to prevent regressions, which appear to be just occurring on AMDGPU (at least in our current lit tests), which I've addressed by adding AMDGPUTargetLowering::isDesirableToCommuteWithShift overrides.
Fixes#57872
Differential Revision: https://reviews.llvm.org/D136042
The bitmask used to extract the bits assumed 16 bit elements and wasn't taking the size of the elements into account.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D135156
This patch contains changes necessary to carry physical condition register (SCC) dependencies through the SDNode scheduler. It adds the edge in the SDNodeScheduler dependency graph instead of inserting the SCC copy between each definition and use. This approach lets the scheduler place instructions in an optimal way placing the copy only when the dependency cannot be resolved.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D133593
Only do this for 16 and 32 register tuples, although we might want to
extend to 8 tuples.
It's incredibly expensive to spill these, and doing so majorly
interferes with the ability to allocate anything else in the function.
The lit tests show mostly sizeable improvements with a handful of tiny
regressions with large vectors.
This patch improves the codegen of extractelement and insertelement for vector
containing 8 elements. Before, a dag combine transformation was generating a
sequence of 8 select/cmp.
This patch changes the upper limit for this transformation and the movrel
instruction will eventually be used instead. Extractlement/insertelement for
vectors containing less than 8 elements are unchanged.
Differential Revision: https://reviews.llvm.org/D126389
Previously SIFoldOperands::foldInstOperand would only fold a
non-inlinable immediate into a single user, so as not to increase code
size by adding the same 32-bit literal operand to many instructions.
This patch removes that restriction, so that a non-inlinable immediate
will be folded into any number of users. The rationale is:
- It reduces the number of registers used for holding constant values,
which might increase occupancy. (On the other hand, many of these
registers are SGPRs which no longer affect occupancy on GFX10+.)
- It reduces ALU stalls between the instruction that loads a constant
into a register, and the instruction that uses it.
- The above benefits are expected to outweigh any increase in code size.
Differential Revision: https://reviews.llvm.org/D114643
This reverts commit 640beb38e7710b939b3cfb3f4c54accc694b1d30.
That commit caused performance degradtion in Quicksilver test QS:sGPU and a functional test failure in (rocPRIM rocprim.device_segmented_radix_sort).
Reverting until we have a better solution to s_cselect_b64 codegen cleanup
Change-Id: Ifc167b3c2dae7a65920676f22a97ba76485f3456
Reviewed By: kzhuravl
Differential Revision: https://reviews.llvm.org/D116686
Change-Id: I1abf49b74a7e2ba0e0205f747a4154a468b9d7f2
This reverts commit 859ebca744e634dcc89a2294ffa41574f947bd62.
The change contained many unrelated changes and e.g. restored
unit test failes for the old lld port.
This reverts commit 640beb38e7710b939b3cfb3f4c54accc694b1d30.
That commit caused performance degradtion in Quicksilver test QS:sGPU and a functional test failure in (rocPRIM rocprim.device_segmented_radix_sort).
Reverting until we have a better solution to s_cselect_b64 codegen cleanup
Change-Id: Ibf8e397df94001f248fba609f072088a46abae08
Reviewed By: kzhuravl
Differential Revision: https://reviews.llvm.org/D115960
Change-Id: Id169459ce4dfffa857d5645a0af50b0063ce1105
Using a BufferSize of one for memory ProcResources will result in better
ILP since it more accurately models the dependencies between memory ops
and their consumers on an in-order processor. After this change, the
scheduler will treat the data edges from loads as blocking so that
stalls are guaranteed when waiting for data to be retreaved from memory.
Since we don't actually track waitcnt here, this should do a better job
at modeling their behavior.
Practically, this means that the scheduler will trigger the 'STALL'
heuristic more often.
This type of change needs to be evaluated experimentally. Preliminary
results are positive.
Fixes: SWDEV-282962
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D114777
Description: This change enables the compare operations to be selected to SALU/VALU form
dependent of the SDNode divergence flag.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D106079
This tends to increase code size but more importantly it reduces vgpr
usage, and could avoid costly readfirstlanes if the result needs to be
in an sgpr.
Differential Revision: https://reviews.llvm.org/D88245
Summary:
Add patterns to select s_cselect in the isel.
Handle more cases of implicit SCC accesses in si-fix-sgpr-copies
to allow new patterns to work.
Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, asbirlea, kerbowa, llvm-commits
Tags: #llvm
Re-commit D81925 with a bugfix D82370.
Differential Revision: https://reviews.llvm.org/D81925
Differential Revision: https://reviews.llvm.org/D82370
Summary:
Add patterns to select s_cselect in the isel.
Handle more cases of implicit SCC accesses in si-fix-sgpr-copies
to allow new patterns to work.
Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, asbirlea, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D81925
It was set in total vector size while the idea was to limit
a number of instructions. Now it started to work with doubles
and thresholds needs to be updated.
Differential Revision: https://reviews.llvm.org/D80322
Even though series of cmd/cndmask can produce quite a lot of
code that is still better than a loop. In case of doubles we
would even produce two loops.
Differential Revision: https://reviews.llvm.org/D80032
SelectMOVRELOffset prevents peeling of a constant from an index
if final base could be negative. isBaseWithConstantOffset() succeeds
if a value is an "add" or "or" operator. In case of "or" it shall
be an add-like "or" which never changes a sign of the sum given a
non-negative offset. I.e. we can safely allow peeling if operator is
an "or".
Differential Revision: https://reviews.llvm.org/D79898
We can produce such vectors in the Promote Alloca pass,
but we are unable to use movrel to operate it and lower
via scratch. Making it legal makes SI_INDIRECT patterns
work.
There is more work to do in subsequent changes:
1. We initialize m0 twice to access each dword. It shall
be possible to only do it once and increment base register
number instead.
2. We also need v16i64/v16f64 but these first need to be
added to tablegen.
Differential Revision: https://reviews.llvm.org/D79808
Summary:
Incorrect code was generated when lowering insertelement operations
for vectors with 8 or 16 bit elements. The value being inserted was
not adjusted for the position of the element within the 32 bit word
and so only the low element within each 32 bit word could receive
the intended value.
Fixed by simply replicating the value to each element of a
congruent vector before the mask and or operation used to
update the intended element.
A number of affected LIT tests have been updated appropriately.
before the mask & or into the intended
Reviewers: arsenm, nhaehnle
Reviewed By: arsenm
Subscribers: llvm-commits, arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D57588
llvm-svn: 352885
This allows to avoid scratch use or indirect VGPR addressing for
small vectors.
Differential Revision: https://reviews.llvm.org/D54606
llvm-svn: 347231