If we have a minimum vlen, we were adjusting StackSize to change the
unit from vscale to bytes, and then calculating the required padding
size for alignment in bytes. However, we then used that padding size as
an offset in vscale units, resulting in misplaced stack objects.
While it would be possible to adjust the object offsets by dividing
AlignmentPadding by ST.getRealMinVLen() / RISCV::RVVBitsPerBlock, we can
simplify the calculation a bit if instead we adjust the alignment to be
in vscale units.
@topperc This fixes a bug I am seeing after #110312, but I am not 100%
certain I am understanding the code correctly, could you please see if
this makes sense to you?
Fix bug that `mir-strip-debug` pass does not remove debug location from
bundled instructions.
Problem arises during testing that debug info does not affect
optimization passes output (`llvm-lit` with ` -Dllc="llc
-debugify-and-strip-all-safe"`), when pass operates on MIR with bundled
instructions + memory operands.
Let mir test check looks like:
```
CHECK-NEXT: BUNDLE {
CHECK-NEXT: $r3 = LD $r1, $r2 :: (load (s64) from %ir.a, !tbaa !2)
CHECK-NEXT: }
```
So as `mir-strip-debug` pass does not process bundled instructions,
running `llc -debugify-and-strip-all-safe` on the test will produce the
following output:
```
BUNDLE {
$r3 = LD $r1, $r2, debug-location !DILocation(line: 3, column: 1, scope: <0x608cb2b99b10>) :: (load (s64) from %ir.a, !tbaa !2)
}
```
And test will fail, but it shouldn't.
Seems like the root cause is that `mir-strip-debug` pass should remove
debug location from bundled instructions.
fixes#112974
partially fixes#70103
### Changes
- Added new tablegen based way of lowering dx intrinsics to DXIL ops.
- Added int_dx_group_memory_barrier_with_group_sync intrinsic in
IntrinsicsDirectX.td
- Added expansion for int_dx_group_memory_barrier_with_group_sync in
DXILIntrinsicExpansion.cpp`
- Added DXIL backend test case
### Related PRs
* [[clang][HLSL] Add GroupMemoryBarrierWithGroupSync intrinsic
#111883](https://github.com/llvm/llvm-project/pull/111883)
* [[SPIRV] Add GroupMemoryBarrierWithGroupSync intrinsic
#111888](https://github.com/llvm/llvm-project/pull/111888)
Restricts hlsl countbits to always return a uint32.
Implements a lowering from llvm.ctpop which has an overloaded return
type to dxil cbits op which always returns uint32.
Closes#112779
This adds the `llvm.sincos` intrinsic, legalization, and lowering.
The `llvm.sincos` intrinsic takes a floating-point value and returns
both the sine and cosine (as a struct).
```
declare { float, float } @llvm.sincos.f32(float %Val)
declare { double, double } @llvm.sincos.f64(double %Val)
declare { x86_fp80, x86_fp80 } @llvm.sincos.f80(x86_fp80 %Val)
declare { fp128, fp128 } @llvm.sincos.f128(fp128 %Val)
declare { ppc_fp128, ppc_fp128 } @llvm.sincos.ppcf128(ppc_fp128 %Val)
declare { <4 x float>, <4 x float> } @llvm.sincos.v4f32(<4 x float> %Val)
```
The lowering is built on top of the existing FSINCOS ISD node, with
additional type legalization to allow for f16, f128, and vector values.
If an undef subregister def is live into another block, we need to
maintain a physreg def to track the liveness of those lanes. This
would manifest a verifier error after branch folding, when the cloned
tail block use no longer had a def.
We need to detect interference with other assigned intervals to avoid
clobbering the undef lanes defined in other intervals, since the undef
def didn't count as interference. This is pretty ugly and adds a new
dependency on LiveRegMatrix, keeping it live for one more pass. It also
adds a lot of implicit operand spam (we really should have a better
representation for this).
There is a missing verifier check for this situation. Added an xfailed
test that demonstrates this. We may also be able to revert the changes
in 47d3cbcf842a036c20c3f1c74255cdfc213f41c2.
It might be better to insert an IMPLICIT_DEF before the instruction
rather than using the implicit-def operand.
Fixes#98474
Now that LLVM 19.1.0 has been out for a while with post-vector-RA
vsetvli insertion enabled by default, this proposes to remove the flag
that restores the old pre-RA behaviour so we only have one configuration
going forward.
That flag was mainly meant as a fallback in case users ran into issues,
but I haven't seen anything reported so far.
With -fomit-frame-pointer, even if we set up a frame pointer for other
reasons (e.g. variable-sized or over-aligned stack allocations), we
don't need to create an ABI-compliant frame record. This means that we
can save all of the general-purpose registers in one push, instead of
splitting it to ensure that the frame pointer and link register are
adjacent on the stack, saving two instructions per function.
As part of FEAT_PAuthLR, a new DWARF Frame Instruction was introduced,
`DW_CFA_AARCH64_negate_ra_state_with_pc`. This instructs Libunwind that
the PC has been used with the signing instruction. This change includes
three commits
- Libunwind support for the newly introduced DWARF Instruction
- CodeGen Support for the DWARF Instructions
- Reversing the changes made in #96377. Due to
`DW_CFA_AARCH64_negate_ra_state_with_pc`'s requirements to be placed
immediately after the signing instruction, this would mean the CFI
Instruction location was not consistent with the generated location when
not using FEAT_PAuthLR. The commit reverses the changes and makes the
location consistent across the different branch protection options.
While this does have a code size effect, this is a negligible one.
For the ABI information, see here:
853286c7ab/aadwarf64/aadwarf64.rst (id23)
When llvm.memcpy or llvm.memmove intrinsics are lowered as a loop in
LowerMemIntrinsics.cpp, the loop consists of a single load/store pair
per iteration. We can improve performance in some cases by emitting
multiple load/store pairs per iteration. This patch achieves that by
increasing the width of the loop lowering type in the GCN target and
letting legalization split the resulting too-wide access pairs into
multiple legal access pairs.
This change only affects lowered memcpys and memmoves with large (>=
1024 bytes) constant lengths. Smaller constant lengths are handled by
ISel directly; non-constant lengths would be slowed down by this change
if the dynamic length was smaller or slightly larger than what an
unrolled iteration copies.
The chosen default unroll factor is the result of microbenchmarks on
gfx1030. This change leads to speedups of 15-38% for global memory and
1.9-5.8x for scratch in these microbenchmarks.
Part of SWDEV-455845.
This was introduced in the now-ratified RVA23 profile (and also added to
the RVA22 text) as a simple way of referring to H plus the set of
supervisor extensions required by RVA23.
https://github.com/riscv/riscv-profiles/blob/main/src/rva23-profile.adoc
This patch simply defines the extension. The next patch will adjust the
RVA23 profile to use it, and at that point I think we will be ready to
mark RVA23 as non-experimental.
Note that I haven't made it so if you enable all extensions that
constitute Sha, Sha is implied. Per #76893 (adding 'B'), the concern is
making this implication might break older external assemblers. Perhaps
this is less of a concern given the relative frequency of
`-march=${foo}_zba_zbb_zbs` vs the collection of H extensions. If we did
want to add that implication, we'd probably want to add it in a separate
patch so it can be easily reverted if found to cause problems.
Some tests contain errors in constrained intrinsic usage, such as missed
or extra type parameters, wrong type parameters order and some other.
---------
Co-authored-by: Andy Kaylor <andy_kaylor@yahoo.com>
Credits: https://github.com/llvm/llvm-project/pull/72976
LLVM ERROR: cannot select: %3:zpr(<vscale x 2 x s64>) = G_MUL %0:fpr,
%1:fpr (in function: xmulnxv2i64)
;; mul
define void @xmulnxv2i64(<vscale x 2 x i64> %a, <vscale x 2 x i64> %b,
ptr %p) {
entry:
%c = mul <vscale x 2 x i64> %a, %b
store <vscale x 2 x i64> %c, ptr %p, align 16
ret void
}
define void @mulnxv4i32(<vscale x 4 x i32> %a, <vscale x 4 x i32> %b,
ptr %p) {
entry:
%c = mul <vscale x 4 x i32> %a, %b
store <vscale x 4 x i32> %c, ptr %p, align 16
ret void
}
define void @mulnxv8i16(<vscale x 8 x i16> %a, <vscale x 8 x i16> %b,
ptr %p) {
entry:
%c = mul <vscale x 8 x i16> %a, %b
store <vscale x 8 x i16> %c, ptr %p, align 16
ret void
}
define void @mulnxv16i8(<vscale x 16 x i8> %a, <vscale x 16 x i8> %b,
ptr %p) {
entry:
%c = mul <vscale x 16 x i8> %a, %b
store <vscale x 16 x i8> %c, ptr %p, align 16
ret void
}
This patch makes it so that specifying all or none for -pgo-analysis-map
along with an explicit option causes an error as this set of options
does not really make sense.
This patch adds none and all options to the -pgo-analysis-map flag,
which do basically what they say on the tin. The none option is added to
enable forcing the pgo-analysis-map by overriding an earlier invocation
of the flag. The all option is just added for convenience.
Both are based on MachineLICMBase, and the functionality there is
"switched" based on a PreRegAlloc flag. This commit is simply about
trusting the original value of that flag, defined by the `MachineLICM`
and `EarlyMachineLICM` classes.
The `PreRegAlloc` flag used to be overwritten it based on MRI.isSSA(),
which is un-reliable due to how it is inferred by the MIRParser. I see
that we can now define isSSA in MIR (thanks @gargaroff ), meaning the
fix isn’t really needed anymore, but redefining that flag still feels
wrong.
Note that I'm looking into upstreaming more changes to MachineLICM, see
[the discourse
thread](https://discourse.llvm.org/t/extending-post-regalloc-machinelicm/82725).
We don't need to copy byval arguments to tail calls via a temporary, if
we can prove that we are not copying from the outgoing argument area.
This patch does this when the source if the argument is one of:
* Memory in the local stack frame, which can't be used for tail-call
arguments.
* A global variable.
We can also avoid doing the copy completely if the source and
destination are the same memory location, which is the case when the
caller and callee have the same signature, and pass some arguments
through unmodified.
When passing byval arguments to tail-calls, we need to store them into
the stack memory in which this the caller received it's arguments. If
any of the outgoing arguments are forwarded from incoming byval
arguments, then the source of the copy is from the same stack memory.
This can result in the copy corrupting a value which is still to be
read.
The fix is to first make a copy of the outgoing byval arguments in local
stack space, and then copy them to their final location. This fixes the
correctness issue, but results in extra copying, which could be
optimised.
Byval arguments which are passed partially in registers get stored into
the local stack frame, but it is valid to tail-call them because the
part which gets spilled is always re-loaded into registers before doing
the tail-call, so it's OK for the spill area to be deallocated.
The ARM backend was checking that the outgoing values for a tail-call
matched the incoming argument values of the caller. This isn't
necessary, because the caller can change the values in both registers
and the stack before doing the tail-call. The actual limitation is that
the callee can't need more stack space for it's arguments than the
caller does.
This is needed for code using the musttail attribute, as well as
enabling tail calls as an optimisation in more cases.
This commit has completed the Extension for Resumable Non Maskable
Interrupts, adding four CRSs and one Trap-Return instruction.
Specification link:["Smrnmi"
Extension](https://github.com/riscv/riscv-isa-manual/blob/main/src/rnmi.adoc)
---------
Co-authored-by: Sam Elliott <sam@lenary.co.uk>
This change is part of this proposal:
https://discourse.llvm.org/t/rfc-all-the-math-intrinsics/78294
- `VecFuncs.def`: define intrinsic to sleef/armpl mapping
- `LegalizerHelper.cpp`: add missing fewerElementsVector handling for
the new atan2 intrinsic
- `AArch64ISelLowering.cpp`: Add arch64 specializations for lowering
like neon instructions
- `AArch64LegalizerInfo.cpp`: Legalize atan2.
Part 5 for Implement the atan2 HLSL Function #70096.
For image_atomic instructions with TFE, in some cases (e.g., when
dmask=3) the disassembler produces dst register with wrong size (e.g.,
image_atomic_smin v5, v1, s[8:15] dmask:0x3 tfe, instead of v[5:7]).
This patch fixes the VDataDwords values for image atomic instructions.