After performing sink-and-fold over a COPY, the original instruction is
replaced with one that produces its output in the destination of the
copy. Its value is still available (in a hard register), so if there are
debug instructions which refer to the (now deleted) virtual register
they could be updated to refer to the hard register, in principle.
However, it's not clear how to do that, moreover in some cases the debug
instructions may need to be replicated proportionally to the number of
the COPY instructions replaced and in some extreme cases we can end up
with quadratic increase in the number of debug instructions, e.g:
int f(int);
void g(int x) {
int y = x + 1;
int t0 = y;
f(t0);
int t1 = y;
f(t1);
}
This reverts commit 8ee07a4be7f7d8654ecf25e7ce0a680975649544.
The revert is breaking AMDGPU backend tests (which I didn't have
enabled), and I don't want to risk breakages over the weekend, so just
revert for now.
This reverts commit 82f68a992b9f89036042d57a5f6345cb2925b2c1.
cd7ba9f3d090afb5d3b15b0dcf379d15d1e11e33 needs to be reverted to fix
test failures on builds without assertions, and this one needs to be
reverted first for that.
Summary:
The AMDGPU backend uses the linker-provided INIT_ARRAY and FINI_ARRAY
sections to call all the global constructors in a single kernel.
Previously this mistakenly used the same iteration logic for both
arrays. The destructors stored in FINI_ARRAY are stored in the same
order as
the ones in the INIT_ARRAY section so we need to traverse it in reverse
order.
Relanding after the revert in fe7b5e2cfcf6848287010291081f85fa1f6bb2ef
using the IR builder interface instead of ConstantExpr.
Summary:
The AMDGPU backend uses the linker-provided INIT_ARRAY and FINI_ARRAY
sections to call all the global constructors in a single kernel.
Previously this mistakenly used the same iteration logic for both
arrays. The destructors stored in FINI_ARRAY are stored in the same
order as
the ones in the INIT_ARRAY section so we need to traverse it in reverse
order.
Summary:
This pass emits the new "nvptx$device$init" and "nvptx$device$fini"
kernels that are callable by the device. This intends to mimic the
method of lowering for AMDGPU where we emit `amdgcn.device.init` and
`amdgcn.device.fini` respectively. These kernels simply iterate a symbol
called `__init_array_start/stop` and `__fini_array_start/stop`.
Normally, the linker provides these symbols automatically. In the AMDGPU
case we only need call the kernel and we call the ctors / dtors.
However, for NVPTX we require the user initializes these variables to
the associated globals that we already emit as a part of this pass.
The motivation behind this change is to move away from OpenMP's handling
of ctors / dtors. I would much prefer that the backend / runtime handles
this. That allows us to handle ctors / dtors in a language agnostic way,
This approach requires that the runtime initializes the associated
globals. They are marked `weak` so we can emit this per-TU. The kernel
itself is `weak_odr` as it is copied exactly.
One downside is that any module containing these kernels elicitis the
"stack size cannot be statically determined warning" every time from
`nvlink` which is annoying but inconsequential for functionality. It
would be nice if there were a way to silence this warning however.
Similar to #70635, this expands the handling of integer to fp
conversions. The code is very similar to the float->integer conversions
with types handled oppositely. There are some extra unhandled cases
which require more handling for ASR operations.
This patch lowers `sdiv x, +/-2**k` to `add + select + shift` when the
short forward branch optimization is enabled. The latter inst seq
performs faster than the seq generated by target-independent
DAGCombiner. This algorithm is described in ***Hacker's Delight***.
This patch also removes duplicate logic in the X86 and AArch64 backend.
But we cannot do this for the PowerPC backend since it generates a
special instruction `addze`.
Fixed:
1. Maximum register pressure calculation at the instruction level.
Previously max RP included both def and use of registers of an
instruction. Now maximum RP includes _uses_ and _early-clobber defs_.
2. Uses were incorrectly tracked and this resulted in a mismatch of
live-in set reported by LiveIntervals and tracked live reg set when the
beginning of the block is reached.
Interface has changed, moveMaxPressure becomes deprecated and
getMaxPressure, resetMaxPressure functions are added. reset function
seem now more consistent.
si-wqm sometimes needs to save the LiveMask in the entry block. Later
on, while looking for a place to enter WQM/WWM, it unconditionally
skips over the first COPY instruction in the entry block. This is
incorrect for functions where the LiveMask doesn't need to be saved, and
therefore the first COPY is more likely a COPY from a function argument
and might need to be in some non-exact mode.
This patch fixes the issue by also checking that the source of the COPY
is the EXEC register.
This produces different code in 3 of the existing tests:
In wwm-reserved.ll, a SGPR copy is now inside the WWM area rather than
outside. This is benign.
In wave32.ll, we end up with an extra register copy. This is because
the first COPY in the block is now part of the WWM block, so
si-pre-allocate-wwm-regs will allocate a new register for its
destination (when it was outside of the WWM region, the register
allocator could just re-use the same register). We might be able to
improve this in si-pre-allocate-wwm-regs but I haven't looked into it.
The same thing happens in dual-source-blend-export.ll, but for that
one it's harder to see because of the scheduling changes. I've uploaded
the before/after si-wqm output for it here:
https://reviews.llvm.org/differential/diff/553445/
Differential Revision: https://reviews.llvm.org/D158841
We only have instructions for OEQ, OLT, and OLE. We need to convert
other comparison codes into those.
I think we'll likely want to split this up in the future to support
optimizations. Maybe do some of it in the legalizer or in a new post
legalizer lowering pass. So this patch is just enough to get something
working without adding 11 additional patterns to tablegen for each type.
This reverts commit 9b2439167d9f794e317fecbdbb0a6e96f9ea4b56.
This was an unrelated NFC change to make a test more useful (really it should
have been first, it was supposed to show the test diff).
The X86FixupLEAs pass drops blockaddress offsets, when splitting up slow
3-ops LEAs, as can be seen in this example:
https://godbolt.org/z/bEsc3Poje
Before running the pass, the first instruction in bb.0 is a LEA with
ebp, ebx and a blockaddress.
After the transformation, the blockaddress is missing.
The reason this happens is because the 3-ops LEA is being splitup into a
2-ops LEA + an add instruction.
However, as hasLEAOffset does not take blockaddresses into
consideration, the add is not emitted and thus leading to the offset
being dropped.
Taking blockaddresses into consideration fixes this issue and results in
the add instruction being emitted.
This fixes#71667
When generating the assembly code for AIX/XCOFF, the .file pseudo-op
needs to be emitted first, before any csects are generated. Otherwise,
information such as the embedded command line will be associated with
part of the object file rather than the entire object file.
We can match this directly in isel with the i32 type being legal.
The generic DAG combine will unpromote part of the pattern and
prevent it from being matched in isel.
The WebKit Calling Convention was created specifically for the WebKit
FTL. FTL
doesn't use LLVM anymore and therefore this calling convention is
obsolete.
This commit removes the WebKit CC, its associated tests, and
documentation.
Extend our PRE logic to cover non-immediate AVL values. This covers
large constant AVLs (which must be materialized in registers), and may
help some code written explicitly with intrinsics.
Looking at the existing code, I can't entirely figure out why I thought
we needed VL == AVL to perform the PRE. My best guess is that I was
worried about the VLMAX < VL < 2 * VLMAX case, but the spec explicitly
says that vsetvli must be determinist on any particular AVL value.
That case was, possibly by accident, covering another legality
precondition. Specifically, by only returning true for immediate and
VLMAX AVL values, we didn't encounter the case where the AVL was a
register and that register wasn't available in the predecessor (e.g. if
AVL is a load in the MBB block itself).
---------
Co-authored-by: Luke Lau <luke_lau@icloud.com>
1. Map R16-R31 to DWARF registers 130-145.
2. Make R16-R31 caller-saved registers.
3. Make R16-31 allocatable only when feature EGPR is supported
4. Make R16-31 availabe for instructions in legacy maps 0/1 and EVEX
space, except XSAVE*/XRSTOR
RFC:
https://discourse.llvm.org/t/rfc-design-for-apx-feature-egpr-and-ndd-support/73031/4
Explanations for some seemingly unrelated changes:
inline-asm-registers.mir, statepoint-invoke-ra-enter-at-end.mir:
The immediate (TargetInstrInfo.cpp:1612) used for the regdef/reguse is
the encoding for the register
class in the enum generated by tablegen. This encoding will change
any time a new register class is added. Since the number is part
of the input, this means it can become stale.
seh-directive-errors.s:
R16-R31 makes ".seh_pushreg 17" legal
musttail-varargs.ll:
It seems some LLVM passes use the number of registers rather the number
of allocatable registers as heuristic.
This PR is to reland #67702 after #70222 in order to reduce some
compile-time regression when EGPR is not used.
This patch enhances the optimization of memcmp calls when only two
outcomes
are needed and comparison fits into one block, for example:
bool result = memcmp(a, b, 6) > 0;
Previously, LLVM would generate unnecessary operations even when the
user of
memcmp was only interested in a binary outcome.
Function parameters marked with inreg are supposed to be allocated to
SGPRs. However, for compute functions, this is ignored and function
parameters are allocated to VGPRs. This fix modifies CC_AMDGPU_Func in
AMDGPUCallingConv.td to use SGPRs if input arg is marked inreg.
---------
Co-authored-by: Jun Wang <jun.wang7@amd.com>
We can assume that the minimal SVE register is 128-bit, when NEON is not
available. And we can lower the shuffle shuffle operation with one
operand to TBL1 SVE instruction.
- Revert "[DAGCombiner] Transform `(icmp eq/ne (and X,C0),(shift X,C1))`
to use rotate or to getter constants." - causes a miscompile, see
112e49b381 (commitcomment-131943923)
- Revert "[X86] Fix gcc warning about mix of enumeral and non-enumeral
types. NFC", which fixes a compiler warning in the commit above
Track the live register state immediately before, instead of after,
MBBI. This makes it simple to track the state at the start or end of a
basic block without a separate (and poorly named) Tracking flag.
This changes the API of the backward(MachineBasicBlock::iterator I)
method, which now recedes to the state just before, instead of just
after, *I. Some clients are simplified by this change.
There is one small functional change shown in the lit tests where
multiple spilled registers all need to be reloaded before the same
instruction. The reloads will now be inserted in the opposite order.
This should not affect correctness.