The depths of the Root and the NewRoot are to be compared in
MachineCombiner::improvesCriticalPathLen(), and while the call to
BlockTrace.getInstrCycles(*Root) includes the Depth of a PHI, for some
reason PHI nodes have been ignored in getOperandDef().
This patch removes the special handling of PHIs in getOperandDef() so that
Root and NewRoot get a fair comparison. This does not affect loop headers
as MachineTraceMetrics handles that case by ignoring incoming PHI edges.
Perform the requested arithmetic and produce a carry output in addition
to the normal result.
Clang has them as builtins (__builtin_add_overflow_p). The middle end
has intrinsics for them (sadd_with_overflow).
AArch64: ADDS Add and set flags
On Neoverse V2, they run at half the throughput of basic arithmetic and
have a limited set of pipelines.
The changeset enables lowering of `llvm.masked.compressstore(%data,
%ptr, %mask)` for RVV for fixed vector type into:
```
%0 = vcompress %data, %mask, %vl
%new_vl = vcpop %mask, %vl
vse %0, %ptr, %1, %new_vl
```
Such lowering is only possible when `%data` fits into available LMULs
and otherwise `llvm.masked.compressstore` is scalarized by
`ScalarizeMaskedMemIntrin` pass.
Even though RVV spec in the section `15.8` provide alternative sequence
for compressstore, use of `vcompress + vcpop` should be a proper
canonical form to lower `llvm.masked.compressstore`. If RISC-V target
find the sequence from `15.8` better, peephole optimization can
transform `vcompress + vcpop` into that sequence.
There were two existing patterns:
`concat_vectors(trunc(x), trunc(y)) -> uzp1(x, y)`
`concat_vectors(assertzext(trunc(x)), assertzext(trunc(y))) -> uzp1(x,
y)`
Move them into a class and add the following `assertsext` pattern to it:
`concat_vectors(assertsext(trunc(x)), assertsext(trunc(y))) -> uzp1(x,
y)`
Add the following transform for v8i8 and v4i16 result types to help with
pattern matching:
`truncating uzp1(x, y) -> trunc(concat(x, y))`
And a pattern to go with it:
`trunc(concat_vectors(x, y)) -> uzp1 (x, y)`
Add another isel pattern for v8i8 and v4i16 result vector types, similar
to
the existing concat pattern, but with a trunc node in the begining:
`trunc(concat_vectors(assertext_trunc(x), assertext_trunc(y))) ->
xtn(uzp1(x, y))`
In preparation of adding a similar instruction for large code model on
AIX for 32-bit, rename the exisitng ADDItocL 64-instruction to ADDItocL8
to match the naming convention of other instructions with 32-bit and
64-bit variants.
This PR:
* adds support for G_SPLAT_VECTOR generic opcode that may be legally
generated instead of G_BUILD_VECTOR by previous passes of the translator
(see https://github.com/llvm/llvm-project/pull/80378 for the source of
breaking changes);
* improves deduction of types for opaque pointers.
This PR also fixes the following issues:
* if a function has ptr argument(s), two functions that have different
SPIR-V type definitions may get identical LLVM function types and break
agreements of global register and duplicate checker;
* checks for pointer types do not account for TypedPointerType.
Update of tests:
* A test case is added to cover the issue with function ptr parameters.
* The first case, that is support for G_SPLAT_VECTOR generic opcode, is
covered by existing test cases.
* Multiple additional checks by `spirv-val` is added to cover more
possibilities of generation of invalid code.
This commit aims to support BPF arena kernel side
[feature](https://lore.kernel.org/bpf/20240209040608.98927-1-alexei.starovoitov@gmail.com/):
- arena is a memory region accessible from both BPF program and
userspace;
- base pointers for this memory region differ between kernel and user
spaces;
- `dst_reg = addr_space_cast(src_reg, dst_addr_space, src_addr_space)`
translates src_reg, a pointer in src_addr_space to dst_reg, equivalent
pointer in dst_addr_space, {src,dst}_addr_space are immediate constants;
- number 0 is assigned to kernel address space;
- number 1 is assigned to user address space.
On the LLVM side, the goal is to make load and store operations on arena
pointers "transparent" for BPF programs:
- assume that pointers with non-zero address space are pointers to
arena memory;
- assume that arena is identified by address space number;
- assume that address space zero corresponds to kernel address space;
- assume that every BPF-side load or store from arena is done via
pointer in user address space, thus convert base pointers using
`addr_space_cast(src_reg, 0, 1)`;
Only load, store, cmpxchg and atomicrmw IR instructions are handled by
this transformation.
For example, the following C code:
```c
#define __as __attribute__((address_space(1)))
void copy(int __as *from, int __as *to) { *to = *from; }
```
Compiled to the following IR:
```llvm
define void @copy(ptr addrspace(1) %from, ptr addrspace(1) %to) {
entry:
%0 = load i32, ptr addrspace(1) %from, align 4
store i32 %0, ptr addrspace(1) %to, align 4
ret void
}
```
Is transformed to:
```llvm
%to2 = addrspacecast ptr addrspace(1) %to to ptr ;; !
%from1 = addrspacecast ptr addrspace(1) %from to ptr ;; !
%0 = load i32, ptr %from1, align 4, !tbaa !3
store i32 %0, ptr %to2, align 4, !tbaa !3
ret void
```
And compiled as:
```asm
r2 = addr_space_cast(r2, 0, 1)
r1 = addr_space_cast(r1, 0, 1)
r1 = *(u32 *)(r1 + 0)
*(u32 *)(r2 + 0) = r1
exit
```
Co-authored-by: Eduard Zingerman <eddyz87@gmail.com>
Implement an abstraction to specify precise overload types supported by
DXIL ops. These overload types are typically a subset of LLVM
intrinsics.
Implement the corresponding changes in DXILEmitter backend.
Add tests to verify expected errors for unsupported overload types at
code generation time.
Add tests to check for correct overload error output.
Handles the INSERT_SUBVECTOR(INSERT_SUBVECTOR(UNDEF,SUB0,0),SUB1,N) pattern
Currently limited to v8i64/v8f64 cases as only AVX512 has decent cross lane 2-input shuffles, the plan is to relax this as I deal with some regressions
A new function attribute named amdgpu_num_work_groups is added. This
attribute, which consists of three integers, allows programmers to let
the compiler know the number of workgroups to be launched in each of the
three dimensions and do optimizations based on that information.
---------
Co-authored-by: Jun Wang <jun.wang7@amd.com>
When we are using the Zcmp extension together with the E extension in
32-bit mode and we need to spill both callee-saved registers as well as
needing a couple of 32-bit stack slots, we emit a meaningless stack
adjustment with cm.push/cm.popret. Furthermore this leads to the stack
slot for the ra being clobbered so control returns to a random location.
This is just a pre-commit test so that the PR for the fix shows the
difference in code generation.
The DIV32/64 throughput was improved since Goldmont in the Atom
architecture. The Alder Lake-E shows similar number too. So we shouldn't
add such tunings to Gracemont and later products.
Checked from Agner Fog's table and uops.info.
With special case with Add constant is 0.
Original message:
We can support these by changing the sext promotion to -zext(-C) and
replacing a sgt check with ugt. Reframing the logic in terms of how the
unsigned range are affected. More comments in the patch.
The new cases check isLegalAddImmediate to avoid some
regressions in lit tests.
G_INSERT and G_EXTRACT are not sufficient to use to represent both
INSERT/EXTRACT on a subregister and INSERT/EXTRACT on a vector.
We would like to be able to INSERT/EXTRACT on vectors in cases that
INSERT/EXTRACT on vector subregisters are not sufficient, so we add
these opcodes.
I tried to do a patch where we treated G_EXTRACT as both
G_EXTRACT_SUBVECTOR and G_EXTRACT_SUBREG, but ran into an infinite loop
at this
[point](8b5b294ec2/llvm/lib/Target/RISCV/RISCVISelLowering.cpp (L9932))
in the SDAG equivalent code.
Re-land 634b0243b8f7acc85af4f16b70e91d86ded4dc83.
T1 allow for an optional registers list,
the register list must be {d0-d15}.
T2 define a mandatory register list,
the register list must be {d0-d31}.
The requirements for T1/T2 are as follows:
T1 T2
Require: v8-M.Main, v8.1-M.Main,
secure state secure state
16 D Regs valid valid
32 D Regs UNDEFINED valid
No D Regs NOP NOP
If all variables in the module are absolute, this means we're running
the pass again on an already lowered module, and that works.
If none of them are absolute, lowering can proceed as usual.
Only diagnose cases where we have a mix of absolute/non-absolute GVs,
which means we added LDS GVs after lowering, which is broken.
See #81491
Split from #75333
Extend repairIntervalsInRange to completely recompute the interva for a
register if subregister defs exist without precise subrange matches
(LaneMask exactly matching subregister).
This occurs when register sequences are lowered to copies such that the
size of the copies do not match any uses of the subregisters formed
(i.e. during twoaddressinstruction).
The subranges without this change are probably legal, but do not match
those generated by live interval computation. This creates problems with
other code that assumes subranges precisely cover all subregisters
defined, e.g. shrinkToUses().
Small mergeable read only data was place on the sdata before, but it
also means it lose the mergeable property, which means lose some code
size optimization opportunity during link time.
On converting an instruction to an early-clobber definition in
convertToThreeAddress, we must also update live intervals for the
register to start at the early-clobber index.