The motivation for this change is a workload generated by the XLA compiler
targeting nvidia GPUs.
This kernel has a few hundred i8 loads and stores. Merging is critical for
performance.
The current LSV doesn't merge these well because it only considers instructions
within a block of 64 loads+stores. This limit is necessary to contain the
O(n^2) behavior of the pass. I'm hesitant to increase the limit, because this
pass is already one of the slowest parts of compiling an XLA program.
So we rewrite basically the whole thing to use a new algorithm. Before, we
compared every load/store to every other to see if they're consecutive. The
insight (from tra@) is that this is redundant. If we know the offset from PtrA
to PtrB, then we don't need to compare PtrC to both of them in order to tell
whether C may be adjacent to A or B.
So that's what we do. When scanning a basic block, we maintain a list of
chains, where we know the offset from every element in the chain to the first
element in the chain. Each instruction gets compared only to the leaders of
all the chains.
In the worst case, this is still O(n^2), because all chains might be of length
1. To prevent compile time blowup, we only consider the 64 most recently used
chains. Thus we do no more comparisons than before, but we have the potential
to make much longer chains.
This rewrite affects many tests. The changes to tests fall into two
categories.
1. The old code had what appears to be a bug when deciding whether a misaligned
vectorized load is fast. Suppose TTI reports that load <i32 x 4> align 4
has relative speed 1, and suppose that load i32 align 4 has relative speed
32.
The intent of the code seems to be that we prefer the scalar load, because
it's faster. But the old code would choose the vectorized load.
accessIsMisaligned would set RelativeSpeed to 0 for the scalar load (and not
even call into TTI to get the relative speed), because the scalar load is
aligned.
After this patch, we will prefer the scalar load if it's faster.
2. This patch changes the logic for how we vectorize. Usually this results in
vectorizing more.
Explanation of changes to tests:
- AMDGPU/adjust-alloca-alignment.ll: #1
- AMDGPU/flat_atomic.ll: #2, we vectorize more.
- AMDGPU/int_sideeffect.ll: #2, there are two possible locations for the call to @foo, and the pass is brittle to this. Before, we'd vectorize in case 1 and not case 2. Now we vectorize in case 2 and not case 1. So we just move the call.
- AMDGPU/adjust-alloca-alignment.ll: #2, we vectorize more
- AMDGPU/insertion-point.ll: #2 we vectorize more
- AMDGPU/merge-stores-private.ll: #1 (undoes changes from git rev 86f9117d476, which appear to have hit the bug from #1)
- AMDGPU/multiple_tails.ll: #1
- AMDGPU/vect-ptr-ptr-size-mismatch.ll: Fix alignment (I think related to #1 above).
- AMDGPU CodeGen: I have difficulty commenting on these changes, but many of them look like #2, we vectorize more.
- NVPTX/4x2xhalf.ll: Fix alignment (I think related to #1 above).
- NVPTX/vectorize_i8.ll: We don't generate <3 x i8> vectors on NVPTX because they're not legal (and eventually get split)
- X86/correct-order.ll: #2, we vectorize more, probably because of changes to the chain-splitting logic.
- X86/subchain-interleaved.ll: #2, we vectorize more
- X86/vector-scalar.ll: #2, we can now vectorize scalar float + <1 x float>
- X86/vectorize-i8-nested-add-inseltpoison.ll: Deleted the nuw test because it was nonsensical. It was doing `add nuw %v0, -1`, but this is equivalent to `add nuw %v0, 0xffff'ffff`, which is equivalent to asserting that %v0 == 0.
- X86/vectorize-i8-nested-add.ll: Same as nested-add-inseltpoison.ll
Differential Revision: https://reviews.llvm.org/D149893
In some cases, breaking large PHIs can very negatively affect
performance (3x more instructions observed in a particular test case).
This patch adds some basic profitability heuristics to help with some of these issues without affecting the "good" cases.
e.g. avoid breaking PHIs if it causes back-and-forth between vector/scalar form for no good reason.
Fixes SWDEV-392803
Fixes SWDEV-393781
Fixes SWDEV-394228
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D147786
DAGISel uses CopyToReg/CopyFromReg to lower PHI nodes. With large PHIs, this can result in poor codegen.
This is because it introduces a need to have a build_vector before copying the PHI value, and that build_vector may have many undef elements. This can cause very high register pressure and abnormal stack usage in some cases.
This scalarization/phi "break-up" can be easily tuned/disabled through CL options in case it's not beneficial for some users.
It's also only enabled for DAGIsel and GlobalISel handles PHIs much better (as it works on the whole function).
This can both scalarize (break a vector into its elements) and simplify (break a vector into smaller, more manageable subvectors) PHIs.
Fixes SWDEV-321581
Reviewed By: kzhuravl
Differential Revision: https://reviews.llvm.org/D143731
vectorize-buffer-fat-pointer.ll required a manual check line fix.
vector-alloca-addrspacecast.ll required a manual fixup of a check
line. partial-regcopy-and-spill-missed-at-regalloc.ll required
re-running update_mir_test_checks. The HSA metadata tests required
avoiding the script touching the type name in the metadata.
annotate-noclobber.ll ran into one update script bug. It deleted a
check line with a 0 offset GEP, moving the following -NEXT check
logically up one line.
This was disabled to prevent regressions, which appear to be just occurring on AMDGPU (at least in our current lit tests), which I've addressed by adding AMDGPUTargetLowering::isDesirableToCommuteWithShift overrides.
Fixes#57872
Differential Revision: https://reviews.llvm.org/D136042
The bitmask used to extract the bits assumed 16 bit elements and wasn't taking the size of the elements into account.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D135156
Surprisingly these were getting legalized to something
zero initialized.
This fixes an infinite loop when combining some vector types.
Also fixes zero initializing some undef values.
SimplifyDemandedVectorElts / SimplifyDemandedBits are not checking
for the legality of the output undefs they are replacing unused
operations with. This resulted in turning vectors into undefs
that were later re-legalized back into zero vectors.
This patch contains changes necessary to carry physical condition register (SCC) dependencies through the SDNode scheduler. It adds the edge in the SDNodeScheduler dependency graph instead of inserting the SCC copy between each definition and use. This approach lets the scheduler place instructions in an optimal way placing the copy only when the dependency cannot be resolved.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D133593
This patch improves the codegen of extractelement and insertelement for vector
containing 8 elements. Before, a dag combine transformation was generating a
sequence of 8 select/cmp.
This patch changes the upper limit for this transformation and the movrel
instruction will eventually be used instead. Extractlement/insertelement for
vectors containing less than 8 elements are unchanged.
Differential Revision: https://reviews.llvm.org/D126389
Fold immediates regardless of how many uses they have. This is expected
to increase overall code size, but decrease register usage.
Differential Revision: https://reviews.llvm.org/D114644
Previously SIFoldOperands::foldInstOperand would only fold a
non-inlinable immediate into a single user, so as not to increase code
size by adding the same 32-bit literal operand to many instructions.
This patch removes that restriction, so that a non-inlinable immediate
will be folded into any number of users. The rationale is:
- It reduces the number of registers used for holding constant values,
which might increase occupancy. (On the other hand, many of these
registers are SGPRs which no longer affect occupancy on GFX10+.)
- It reduces ALU stalls between the instruction that loads a constant
into a register, and the instruction that uses it.
- The above benefits are expected to outweigh any increase in code size.
Differential Revision: https://reviews.llvm.org/D114643
In some cases StructurizeCFG inserts i1 xor instructions to invert
predicates. Add a quick loop to clean these up afterwards if we can get
away with modifying an existing compare instruction instead.
(StructurizeCFG is generally run late in the pipeline so instcombine
does not clean them up for us.)
Differential Revision: https://reviews.llvm.org/D118623
This reverts commit a6b54ddaba2d5dc0f72dcc4591c92b9544eb0016.
Apparently it is not safe to modify the condition even if it passes the
hasOneUse test, because StructurizeCFG might have other references to
the condition that are not manifest in the IR use-def chains.
This avoids various cases where StructurizeCFG would otherwise insert an
xor i1 instruction, and it since it generally runs late in the pipeline,
instcombine does not clean up the xor-of-cmp pattern.
Differential Revision: https://reviews.llvm.org/D118478
This reverts commit 640beb38e7710b939b3cfb3f4c54accc694b1d30.
That commit caused performance degradtion in Quicksilver test QS:sGPU and a functional test failure in (rocPRIM rocprim.device_segmented_radix_sort).
Reverting until we have a better solution to s_cselect_b64 codegen cleanup
Change-Id: Ifc167b3c2dae7a65920676f22a97ba76485f3456
Reviewed By: kzhuravl
Differential Revision: https://reviews.llvm.org/D116686
Change-Id: I1abf49b74a7e2ba0e0205f747a4154a468b9d7f2
This reverts commit 859ebca744e634dcc89a2294ffa41574f947bd62.
The change contained many unrelated changes and e.g. restored
unit test failes for the old lld port.
This reverts commit 640beb38e7710b939b3cfb3f4c54accc694b1d30.
That commit caused performance degradtion in Quicksilver test QS:sGPU and a functional test failure in (rocPRIM rocprim.device_segmented_radix_sort).
Reverting until we have a better solution to s_cselect_b64 codegen cleanup
Change-Id: Ibf8e397df94001f248fba609f072088a46abae08
Reviewed By: kzhuravl
Differential Revision: https://reviews.llvm.org/D115960
Change-Id: Id169459ce4dfffa857d5645a0af50b0063ce1105
Using a BufferSize of one for memory ProcResources will result in better
ILP since it more accurately models the dependencies between memory ops
and their consumers on an in-order processor. After this change, the
scheduler will treat the data edges from loads as blocking so that
stalls are guaranteed when waiting for data to be retreaved from memory.
Since we don't actually track waitcnt here, this should do a better job
at modeling their behavior.
Practically, this means that the scheduler will trigger the 'STALL'
heuristic more often.
This type of change needs to be evaluated experimentally. Preliminary
results are positive.
Fixes: SWDEV-282962
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D114777
The compiler was generating symbols in the final code object for local
branch target labels. This bloats the code object, slows down the loader,
and is only used to simplify disassembly.
Use '--symbolize-operands' with llvm-objdump to improve readability of the
branch target operands in disassembly.
Fixes: SWDEV-312223
Reviewed By: scott.linder
Differential Revision: https://reviews.llvm.org/D114273
Use GCNHazardRecognizer in postra sched.
Updated tests for the new schedules.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D109536
Change-Id: Ia86ba2ae168f12fb34b4d8efdab491f84d936cde
Description: This change enables the compare operations to be selected to SALU/VALU form
dependent of the SDNode divergence flag.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D106079
The requested register class priorities weren't respected
globally. Not sure why this is a target option, and not just the
expected behavior (recently added in
1a6dc92be7d68611077f0fb0b723b361817c950c). This avoids an allocation
failure when many wide tuple spills are introduced. I think this is a
workaround since I would not expect the allocation priority to be
required, and only a performance hint. The allocator should be smarter
about when only a subregister needs to be spilled and restored.
This does regress a couple of degenerate store stress lit tests which
shouldn't be too important.
Using REG_SEQUENCE produces better code than INSERT_SUBREG,
we can omit one move instruction in many cases.
Fixes: SWDEV-298028
Differential Revision: https://reviews.llvm.org/D107602
This is to allow 64 bit constant rematerialization. If a constant
is split into two separate moves initializing sub0 and sub1 like
now RA cannot rematerizalize a 64 bit register.
This gives 10-20% uplift in a set of huge apps heavily using double
precession math.
Fixes: SWDEV-292645
Differential Revision: https://reviews.llvm.org/D104874
Add SReg_224, VReg_224, AReg_224, etc.
Link 224-bit types with v7i32/v7f32.
Link existing 192-bit types to newly added v3i64/v3f64/v6i32/v6f32.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D104622
This tends to increase code size but more importantly it reduces vgpr
usage, and could avoid costly readfirstlanes if the result needs to be
in an sgpr.
Differential Revision: https://reviews.llvm.org/D88245
This reverts commit ca907bfb57d8ad3ec3bcc2cff2abab7b1b933af6.
According to michel.daenzer,
> This completely broke the Mesa radeonsi driver on Navi 14. Xorg +
> xterm come up with major corruption & psychedelic colours.
When memory operations are outstanding on function calls, either the
caller or the callee can insert a waitcnt to ensure that all reads are
finished.
Calls need some time to be executed, so if the callee inserts the
waitcnt, filling the instruction buffer and waiting for memory will be
interleaved, hiding some latency. This comes at the cost of having a
waitcnt inside functions that may not be needed as no memory operations
are outstanding.
For function calls, this is already implemented. The same principal
applies to returns: If the caller inserts a waitcnt after the call, the
callee does not have to wait and the return and memory operation can be
run in parallel.
This commit implements waiting in the caller after returning from a
function call.
Differential Revision: https://reviews.llvm.org/D87674
Clustering loads has caching benefits, but as far as I know there is no
advantage to clustering stores on any AMDGPU subtargets.
The disadvantage is that it tends to increase register pressure and
restricts scheduling freedom.
Differential Revision: https://reviews.llvm.org/D85530
Unfortunately this ends up not working as expected on targets with
16-bit operations due to AMDGPUCodeGenPrepare's promotion of uniform
16-bit ops to i32.
The vector case annoyingly requires switching the checked opcode,
since constants for vectors aren't directly handled.
I also need to think more carefully about whether this is valid for i1.
Fix 64-bit copy to SCC by restricting the pattern resulting
in such a copy to subtargets supporting 64-bit scalar compare,
and mapping the copy to S_CMP_LG_U64.
Before introducing the S_CSELECT pattern with explicit SCC
(0045786f146e78afee49eee053dc29ebc842fee1), there was no need
for handling 64-bit copy to SCC ($scc = COPY sreg_64).
The proposed handling to read only the low bits was however
based on a false premise that it is only one bit that matters,
while in fact the copy source might be a vector of booleans and
all bits need to be considered.
The practical problem of mapping the 64-bit copy to SCC is that
the natural instruction to use (S_CMP_LG_U64) is not available
on old hardware. Fix it by restricting the problematic pattern
to subtargets supporting the instruction (hasScalarCompareEq64).
Differential Revision: https://reviews.llvm.org/D85207
Add handling of s_andn2 and mask of 0.
This eliminates redundant instructions from uniform control flow.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D83641
Summary:
Add patterns to select s_cselect in the isel.
Handle more cases of implicit SCC accesses in si-fix-sgpr-copies
to allow new patterns to work.
Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, asbirlea, kerbowa, llvm-commits
Tags: #llvm
Re-commit D81925 with a bugfix D82370.
Differential Revision: https://reviews.llvm.org/D81925
Differential Revision: https://reviews.llvm.org/D82370
Summary:
Make use of both the - (1) clustered bytes and (2) cluster length, to decide on
the max number of mem ops that can be clustered. On an average, when loads
are dword or smaller, consider `5` as max threshold, otherwise `4`. This
heuristic is purely based on different experimentation conducted, and there is
no analytical logic here.
Reviewers: foad, rampitec, arsenm, vpykhtin
Reviewed By: rampitec
Subscribers: llvm-commits, kerbowa, hiraditya, t-tye, Anastasia, tpr, dstuttard, yaxunl, nhaehnle, wdng, jvesely, kzhuravl, thakis
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D82393
Summary:
Add patterns to select s_cselect in the isel.
Handle more cases of implicit SCC accesses in si-fix-sgpr-copies
to allow new patterns to work.
Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, asbirlea, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D81925
Summary:
Make use of both the - (1) clustered bytes and (2) cluster length, to decide on
the max number of mem ops that can be clustered. On an average, when loads
are dword or smaller, consider `5` as max threshold, otherwise `4`. This heuristic
is purely based on different experimentation conducted, and there is no analytical
logic here.
Reviewers: foad, rampitec, arsenm, vpykhtin
Reviewed By: foad, rampitec
Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, Anastasia, t-tye, hiraditya, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D81085
SelectMOVRELOffset prevents peeling of a constant from an index
if final base could be negative. isBaseWithConstantOffset() succeeds
if a value is an "add" or "or" operator. In case of "or" it shall
be an add-like "or" which never changes a sign of the sum given a
non-negative offset. I.e. we can safely allow peeling if operator is
an "or".
Differential Revision: https://reviews.llvm.org/D79898
We can produce such vectors in the Promote Alloca pass,
but we are unable to use movrel to operate it and lower
via scratch. Making it legal makes SI_INDIRECT patterns
work.
There is more work to do in subsequent changes:
1. We initialize m0 twice to access each dword. It shall
be possible to only do it once and increment base register
number instead.
2. We also need v16i64/v16f64 but these first need to be
added to tablegen.
Differential Revision: https://reviews.llvm.org/D79808