Re-normalize branch-weights after removing a block successor to avoid
branch-weights not adding up to 100%. This changes MIR for the
`test/CodeGen/PowerPC/branch_coalesce.ll` test like this:
```diff
- successors: %bb.6(0x40000000); %bb.6(50.00%)
+ successors: %bb.6(0x80000000); %bb.6(100.00%)
```
This doesn't affect codegen on its own but fixing this helps with
fluctuations I have with some of my upcoming changes.
This patch adds the DAG isel changes for kernel argument preloading.
These changes are not usable with older firmware but subsequent patches
in the series will make the codegen backwards compatible. This patch
should only be submitted alongside that subsequent patch.
Preloading here begins from the start of the kernel arguments until the
amount of arguments indicated by the CL flag
amdgpu-kernarg-preload-count.
Aggregates and arguments passed by-ref are not supported.
Special care for the alignment of the kernarg segment is needed as well
as consideration of the alignment of addressable SGPR tuples when we
cannot directly use misaligned large tuples that the arguments are
loaded to.
Reviewed By: bcahoon
Differential Revision: https://reviews.llvm.org/D158579
Preloaded kernel arguments should not be lowered in the IR pass
AMDGPULowerKernelArguments. Therefore it's necessary to calculate the
total number of user SGPRs that are available for preloading and how
many SGPRs would be required to preload each argument to determine
whether we should skip lowering i.e. the argument will be preloaded
instead.
Reviewed By: bcahoon
Differential Revision: https://reviews.llvm.org/D156853
If LMUL is more than m1, we can be more aggressive about narrowing the
build_vector via a vsext if legal. If the narrow build_vector gets
lowered as a load, while both are linear in lmul, load uops are
generally more expensive than extend uops. If the narrow build_vector
gets lowered via dominant values, that work is linear in both #unique
elements and LMUL. So provided the number of unique values > 2, this is
a net win in work performed.
Teach the si-fix-sgpr-copies pass to deal with REG_SEQUENCE, PHI or
INSERT_SUBREG where the result is an SGPR, but some of the inputs are
constants materialized into VGPRs. This may happen in cases where for
instance several instructions use an immediate zero and SelectionDAG
chooses to put it in a VGPR to satisfy all of them. This however causes
the si-fix-sgpr-copies to try to switch the whole chain to VGPR and may
lead to illegal VGPR-to-SGPR copies. Rematerializing the constant into
an SGPR fixes the issue.
This patch adds a new code transformation to the `MachineSink` pass,
that tries to sink copies of an instruction, when the copies can be folded
into the addressing modes of load/store instructions, or
replace another instruction (currently, copies into a hard register).
The criteria for performing the transformation is that:
* the register pressure at the sink destination block must not
exceed the register pressure limits
* the latency and throughput of the load/store or the copy must not deteriorate
* the original instruction must be deleted
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D152828
Extra uses for variables outside the loop can mess with the generation
of postinc variables. This patch alters the collection of loop invariant
fixups in LSR when the target is optimizing for PostInc, to exclude the
collection of these extra uses. It is expected that the variable can be
rematerialized, which will lead to a more optimal sequence of
instructions in the loop.
Global ISel now selects `UMULL` and `UMULL2` instructions.
G_MUL instruction with input operands coming from `SEXT` or `ZEXT`
operations are turned into UMULL
G_MUL instructions with v2s64 result type is always scalarised except:
`mul ( unmerge( ext ), unmerge( ext ))`
So the extend could be unmerged and fold away the unmerge in the middle:
`mul ( unmerge( ext ), unmerge( ext ))` =>
`mul ( unmerge( merge( ext( unmerge )), unmerge( merge( ext( unmerge
))))` =>
`mul ( ext(unmerge)), ( ext( unmerge ))) `
Adds new extension SPV_KHR_expect_assume, new capability
ExpectAssumeKHR as well as the new instructions:
* OpExpectKHR
* OpAssumeTrueKHR
These are lowered from respectively llvm.expect.<ty> and llvm.assume
intrinsics.
Previously https://reviews.llvm.org/D157696
This is part of the work to address the D155472 regressions, there's a number of issues with generalizing this fold which is why I'm just adding SRL support atm.
Differential Revision: https://reviews.llvm.org/D159533
Pulled out of D159533, which encourages (zext (trunc x)) -> x folds, leading to more ISD::FSHR nodes, which was breaking some existing AMDGPUISD::PERM tests
Differential Revision: https://reviews.llvm.org/D159533
Reverts 6cb3866b1ce9d835402e414049478cea82427cf1.
Analysis of failures on buildbots with expensive checks enabled showed
that the problem was triggered by changes in another commit,
469b3bfad20550968ac428738eb1f8bb8ce3e96d, and was caused by the bug
addressed in #67245.
Previously, we where taking `CurVT` before finalizing `ToCast` which
meant potentially returning an `SDValue` with an illegal `ValueType`
for the operation.
Fix is to just take `CurVT` after we have finalized `ToCast` with
`PeekThroughCastsAndTrunc`.
This reverts commit 0def4e6b0f638b97a73bd4674365961d8fabda28, applies a
quick fix that disallows merging two pre-indexed loads, and adds MIR
regression tests.
Differential Revision: https://reviews.llvm.org/D152407
We need to use a COPY so the register coalescer can replace reads
of the register we copy to with X0. This is needed so that we use
X0 on instructions that don't have an immediate form.
This was reviewed as #67202.
I had to disable the late copy propagation pass that can see through
the ADDI we were previously emitting. We really want to get this
in the register coalescer if not even earlier.
This PR makes `-basic-block-sections` handle functions with custom
non-dot-text sections correctly. Cold parts of such functions must be
placed in the same section (not in `.text.split`) but with a unique id.
I'd guarded this case in D158874 to avoid regressions, and decided to go investigate what was going on. The solution turns out to be a generic splat matching extension to handle INSERT_SUBVECTOR. In theory, we could see these from other sources as well, but for some reason we only seem to see the i64 extract on rv32 case in practice. Not sure why that is to be honest.
Differential Revision: https://reviews.llvm.org/D159230
Add the command line option --no-trap-after-noreturn, which exposes the
pre-existing TargetOption `NoTrapAfterNoreturn`.
This pull request was split off from this one:
https://github.com/llvm/llvm-project/pull/65876
This patch introduces a new version for the basic block sections profile
as was requested in D158442, while keeping backward compatibility for
the old version.
The new encoding is as follows:
```
m <module_name>
f <function_name_1> <function_name_2>...
c <bb_id_1> <bb_id_2> <bb_id_3>
c <bb_id_4> <bb_id_5>
...
```
Module name specifier (starting with 'm') is optional and allows
distinguishing profiles for internal-linkage functions with the same
name. If not specified, profile will be applied to any function with the
same name.
Function name specifier (starting with 'f') can specify multiple
function name aliases.
Finally, basic block clusters are specified by 'c' and specify the
cluster of basic blocks, and the internal order in which they must be
placed in the same section.
The MI Peephole pass has grown to include a large number of transformations over the years. Many of the transformations require re-computation of kill flags but don't do a good job of re-computing them. This causes us to have very common failures when the compiler is built with expensive checks. Over time, we added and augmented a function that is supposed to go and fix up kill flags after each transformation but we keep missing cases.
This patch does the following:
- Removes the function to re-compute kill flags
- Adds LiveVariables to compute and maintain kill flags while transforming code
- Adds re-computation of kill flags for the post-RA peepholes for each block that contains a transformed instruction
Reviewed By: stefanp
Differential Revision: https://reviews.llvm.org/D133103
Unlike the per-lane mov*dup broadcast shuffles, broadcastsd/ss need port5 to splat across lanes
Found while reviewing a llvm-exegesis capture (and matches Agner + uops.info numbers) - I can't find any more easy wins from these captures so that will be it for now.
To simplify handling PAuth in the machine outliner, introduce a
separate AArch64PointerAuth pass that is executed after both
Prologue/Epilogue Inserter and Machine Outliner passes.
After moving to AArch64PointerAuth, signLR and authenticateLR are
not used outside of their class anymore, so make them private and
simplify accordingly.
The new pass is added via AArch64PassConfig::addPostBBSections(),
so that it can change the code size before branch relaxation occurs.
AArch64BranchTargets is placed there too, so it can take into account
any PACI(A|B)SP instructions and not excessively add BTIs at the start
of functions.
Reviewed By: tmatheson
Differential Revision: https://reviews.llvm.org/D159357
This fills out some extra cases for sin/cos testing for various types under
Global ISel, which seem to all do OK. The existing tests in
sincospow-vector-expansion.ll can be removed, as they are now covered
elsewhere.
This is a followup to #66988. The implementation there did not
account for the possibility of the catch object frame index
referrring to a fixed object, which is the case on win64.
Noticed on D159533 and I've finally dealt with the x86 regressions - MatchingStackOffset wasn't peeking through AssertZext nodes while trying to find CopyFromReg/Load sources, it was only removing them if they were part of a (trunc (assertzext x)) pattern.
Reapplied after being reverted at 4389252c58b783ce5b - which should be addressed by D159537 / 6d2679992e58b
The existing fake True16 instructions using 32-bit VGPRs are supposed to
co-exist with real ones until all the necessary True16 functionality is
implemented and relevant tests are updated.
Reviewed By: arsenm, Joe_Nash
Differential Revision: https://reviews.llvm.org/D156101
The write to the SEH catch object happens before cleanuppads are
executed, while the first reference to the object will typically
be in a catchpad.
If we make use of first-use analysis, we may end up allocating
an alloca used inside the cleanuppad and the catch object at the
same stack offset, which would be incorrect.
https://reviews.llvm.org/D86673 was a previous attempt to fix it.
It used the heuristic "a slot loaded in a WinEH pad and never
written" to detect catch objects. However, because it checks
for more than one load (while probably more than zero was
intended), the fix does not actually work.
The general approach also seems dubious to me, so this patch
reverts that change entirely, and instead marks all catch object
slots as conservative (i.e. excluded from first-use analysis)
based on the WinEHFuncInfo. As far as I can tell we don't need
any heuristics here, we know exactly which slots are affected.
Fixes https://github.com/llvm/llvm-project/issues/66984.