In our default SelectionDAG where i32 isn't legal, the zext will become
and i64 AND and often get optimized out on its own. With i32 legal, we
need to turn it in into sext.w and rely on RISCVOptWInstrs to remove it.
Coerce the register bank based on the users of the G_LOAD or the
defining instruction for the G_STORE.
s64 on rv32 is handled by forcing the FPRB register bank.
Most (x86) swiftasync functions tend to use both SelectionDAGISel and
FastISel lowering:
* FastISel argument lowering can only handle C calling convention.
* FastISel fails mid-BB in a number of ways, including in simple `ret
void` instructions under certain circumstances.
This dance of SelectionDAG (argument) -> FastISel (some instructions) ->
SelectionDAG(remaining instructions) is lossy; in particular, Argument
information lowering is cleared after that first SelectionDAG run.
Since swiftasync functions rely heavily on proper Argument lowering for
debug information, this patch disables the use of FastISel in such
functions.
PR #71614 identified an issue in the lowering of v1f16 vector compares,
where the `v1i1 setcc` is expanded to `v1i16 setcc`, and the `v1i16
setcc` tries to be expanded to a `v2i16 setcc` which fails. For floating
point types we can let them scalarize instead though, generating a
`setcc f16` that can be lowered using normal fp16 lowering.
07a8ff4892b2a54f0bd5843f863bcffa7a258f1f added a special case combine
for v1 vselect to expand the predicate type to the same size as the fcmp
operands. This turns that off for float types, allowing them to
scalarize naturally, which hopefully fixes the issue by preventing the
v1i16 setcc, meaning it wont try to widen to larger vectors.
The codegen might not be optimal, but as far as I can tell everything
generated successfully, providing that no `v1i16 setcc v1f16`
instructions get generated.
Extends the existing code that performed something similar for SUBV_BROADCAST_LOAD, but this is just for cases where AVX2 targets loads full width 128-bit constant vectors but broadcasts the equivalent 256-bit constant vector
Fixes AVX2 case for Issue #70947
The compiler currently abends with an impossible reg-to-reg copy when
producing a negative zero FP immediate on RV32 with -Zdinx. This is
because we emit a negation that uses FP registers. Emit the right node
to produce correct code.
SGPR spill VGPRs are WWM registers so allow them to be allocated by
SIPreAllocateWWMRegs pass.
This intentionally prevents spilling of these VGPRs when enabled.
Add subrange tracking and handling for LiveIntervals during PHI
elimination.
This requires extending MachineBasicBlock::SplitCriticalEdge to also
update subrange intervals.
Add support for bf16x2 instructions such as setp, fneg, fabs, etc;
Fix the instructions that were not differentiated between sm_80 and
sm_90 support, such as fpround etc.
Add more bf16 test cases to ensure the correct behavior.
---------
Co-authored-by: shenhan03 <shenhan03@kuaishou.com>
After performing sink-and-fold over a COPY, the original instruction is
replaced with one that produces its output in the destination of the
copy. Its value is still available (in a hard register), so if there are
debug instructions which refer to the (now deleted) virtual register
they could be updated to refer to the hard register, in principle.
However, it's not clear how to do that, moreover in some cases the debug
instructions may need to be replicated proportionally to the number of
the COPY instructions replaced and in some extreme cases we can end up
with quadratic increase in the number of debug instructions, e.g:
int f(int);
void g(int x) {
int y = x + 1;
int t0 = y;
f(t0);
int t1 = y;
f(t1);
}
This reverts commit 8ee07a4be7f7d8654ecf25e7ce0a680975649544.
The revert is breaking AMDGPU backend tests (which I didn't have
enabled), and I don't want to risk breakages over the weekend, so just
revert for now.
This reverts commit 82f68a992b9f89036042d57a5f6345cb2925b2c1.
cd7ba9f3d090afb5d3b15b0dcf379d15d1e11e33 needs to be reverted to fix
test failures on builds without assertions, and this one needs to be
reverted first for that.
Summary:
The AMDGPU backend uses the linker-provided INIT_ARRAY and FINI_ARRAY
sections to call all the global constructors in a single kernel.
Previously this mistakenly used the same iteration logic for both
arrays. The destructors stored in FINI_ARRAY are stored in the same
order as
the ones in the INIT_ARRAY section so we need to traverse it in reverse
order.
Relanding after the revert in fe7b5e2cfcf6848287010291081f85fa1f6bb2ef
using the IR builder interface instead of ConstantExpr.
Summary:
The AMDGPU backend uses the linker-provided INIT_ARRAY and FINI_ARRAY
sections to call all the global constructors in a single kernel.
Previously this mistakenly used the same iteration logic for both
arrays. The destructors stored in FINI_ARRAY are stored in the same
order as
the ones in the INIT_ARRAY section so we need to traverse it in reverse
order.
Summary:
This pass emits the new "nvptx$device$init" and "nvptx$device$fini"
kernels that are callable by the device. This intends to mimic the
method of lowering for AMDGPU where we emit `amdgcn.device.init` and
`amdgcn.device.fini` respectively. These kernels simply iterate a symbol
called `__init_array_start/stop` and `__fini_array_start/stop`.
Normally, the linker provides these symbols automatically. In the AMDGPU
case we only need call the kernel and we call the ctors / dtors.
However, for NVPTX we require the user initializes these variables to
the associated globals that we already emit as a part of this pass.
The motivation behind this change is to move away from OpenMP's handling
of ctors / dtors. I would much prefer that the backend / runtime handles
this. That allows us to handle ctors / dtors in a language agnostic way,
This approach requires that the runtime initializes the associated
globals. They are marked `weak` so we can emit this per-TU. The kernel
itself is `weak_odr` as it is copied exactly.
One downside is that any module containing these kernels elicitis the
"stack size cannot be statically determined warning" every time from
`nvlink` which is annoying but inconsequential for functionality. It
would be nice if there were a way to silence this warning however.
Similar to #70635, this expands the handling of integer to fp
conversions. The code is very similar to the float->integer conversions
with types handled oppositely. There are some extra unhandled cases
which require more handling for ASR operations.
This patch lowers `sdiv x, +/-2**k` to `add + select + shift` when the
short forward branch optimization is enabled. The latter inst seq
performs faster than the seq generated by target-independent
DAGCombiner. This algorithm is described in ***Hacker's Delight***.
This patch also removes duplicate logic in the X86 and AArch64 backend.
But we cannot do this for the PowerPC backend since it generates a
special instruction `addze`.
Fixed:
1. Maximum register pressure calculation at the instruction level.
Previously max RP included both def and use of registers of an
instruction. Now maximum RP includes _uses_ and _early-clobber defs_.
2. Uses were incorrectly tracked and this resulted in a mismatch of
live-in set reported by LiveIntervals and tracked live reg set when the
beginning of the block is reached.
Interface has changed, moveMaxPressure becomes deprecated and
getMaxPressure, resetMaxPressure functions are added. reset function
seem now more consistent.
si-wqm sometimes needs to save the LiveMask in the entry block. Later
on, while looking for a place to enter WQM/WWM, it unconditionally
skips over the first COPY instruction in the entry block. This is
incorrect for functions where the LiveMask doesn't need to be saved, and
therefore the first COPY is more likely a COPY from a function argument
and might need to be in some non-exact mode.
This patch fixes the issue by also checking that the source of the COPY
is the EXEC register.
This produces different code in 3 of the existing tests:
In wwm-reserved.ll, a SGPR copy is now inside the WWM area rather than
outside. This is benign.
In wave32.ll, we end up with an extra register copy. This is because
the first COPY in the block is now part of the WWM block, so
si-pre-allocate-wwm-regs will allocate a new register for its
destination (when it was outside of the WWM region, the register
allocator could just re-use the same register). We might be able to
improve this in si-pre-allocate-wwm-regs but I haven't looked into it.
The same thing happens in dual-source-blend-export.ll, but for that
one it's harder to see because of the scheduling changes. I've uploaded
the before/after si-wqm output for it here:
https://reviews.llvm.org/differential/diff/553445/
Differential Revision: https://reviews.llvm.org/D158841