_TIED and _MASK_TIED pseudos have one less operand compared to other
pseudos, thus we shouldn't attach the same number of SchedRead for these
instructions.
I don't think we have a way to (explicitly) check scheduling classes. So
I only test this patch with existing tests.
If we have a minimum vlen, we were adjusting StackSize to change the
unit from vscale to bytes, and then calculating the required padding
size for alignment in bytes. However, we then used that padding size as
an offset in vscale units, resulting in misplaced stack objects.
While it would be possible to adjust the object offsets by dividing
AlignmentPadding by ST.getRealMinVLen() / RISCV::RVVBitsPerBlock, we can
simplify the calculation a bit if instead we adjust the alignment to be
in vscale units.
@topperc This fixes a bug I am seeing after #110312, but I am not 100%
certain I am understanding the code correctly, could you please see if
this makes sense to you?
fixes#112974
partially fixes#70103
### Changes
- Added new tablegen based way of lowering dx intrinsics to DXIL ops.
- Added int_dx_group_memory_barrier_with_group_sync intrinsic in
IntrinsicsDirectX.td
- Added expansion for int_dx_group_memory_barrier_with_group_sync in
DXILIntrinsicExpansion.cpp`
- Added DXIL backend test case
### Related PRs
* [[clang][HLSL] Add GroupMemoryBarrierWithGroupSync intrinsic
#111883](https://github.com/llvm/llvm-project/pull/111883)
* [[SPIRV] Add GroupMemoryBarrierWithGroupSync intrinsic
#111888](https://github.com/llvm/llvm-project/pull/111888)
Restricts hlsl countbits to always return a uint32.
Implements a lowering from llvm.ctpop which has an overloaded return
type to dxil cbits op which always returns uint32.
Closes#112779
This patch adds assembly/disassembly for the following instructions:
FMMLA (widening, FP16 to FP32)
FMMLA (widening, FP8 to FP16)
FMMLA (widening, FP8 to FP32)
According to [1]
[1]https://developer.arm.com/documentation/ddi0602
This adds the `llvm.sincos` intrinsic, legalization, and lowering.
The `llvm.sincos` intrinsic takes a floating-point value and returns
both the sine and cosine (as a struct).
```
declare { float, float } @llvm.sincos.f32(float %Val)
declare { double, double } @llvm.sincos.f64(double %Val)
declare { x86_fp80, x86_fp80 } @llvm.sincos.f80(x86_fp80 %Val)
declare { fp128, fp128 } @llvm.sincos.f128(fp128 %Val)
declare { ppc_fp128, ppc_fp128 } @llvm.sincos.ppcf128(ppc_fp128 %Val)
declare { <4 x float>, <4 x float> } @llvm.sincos.v4f32(<4 x float> %Val)
```
The lowering is built on top of the existing FSINCOS ISD node, with
additional type legalization to allow for f16, f128, and vector values.
I received a report that the outliner drops memoperands and causes this
code to crash. Handle this by only copying the memoperand if it exists.
Similar for expandRV32ZdinxLoad
This patch aims to reduce the include used by AArch64ISelLowering, allowing it
to be included by unittests so that they can reference the AArch64ISD nodes.
It:
- Moves the inclusion of AArch64SMEAttributes.h to the uses.
- Moves LowerPtrAuthGlobalAddressStatically to a static function, so that
AArch64PACKey is not required in the header.
- Moves the definitions of getExceptionPointerRegister to the cpp file, to
remove the reference of AArch64::X0.
This only changes `llvm/lib/Target/AMDGPU/SIISelLowering.cpp`.
There are five uses of `std::tie` remaining because they can't be
replaced with
C++17 structured bindings.
Split the scheduling classes of VMADC/VMSBC away from that of VADC/VSBC.
Because the former are technically mask-producing instructions rather
than normal vector arithmetics, which might have different performance
characteristics on some processors.
This is effectively NFC.
…structions
This patch adds the following instructions:
Conversion between floating-point and integer:
FCVT{AS, AU, MS, MU, NS, NU, PS, PU, ZS, ZU}
{S,U}CVTF
Advanced SIMD three-register extension:
FMMLA
According to https://developer.arm.com/documentation/ddi0602
Co-authored-by: Marian Lukac marian.lukac@arm.com
Co-authored-by: Spencer Abson spencer.abson@arm.com
This patch adds the appropriate hookups in X86PfmCounters.td for
SapphireRapids. This is mostly to fix errors when some of my jobs that
only really need dummy counters get scheduled on sapphire rapids
machines, but figured I might as well do it properly while here. I do
not have hardware access to test this currently, but this matches
exactly with what is in the libpfm source code.
This is defined by the `-aarch64-streaming-hazard-size` option or its
alias `-aarch64-stack-hazard-size` (the original name). It has been
renamed to be more general as this option will (for the time being) be
used to detect if the current target has streaming mode memory hazards.
---------
Co-authored-by: Hari Limaye <hari.limaye@arm.com>
Now that LLVM 19.1.0 has been out for a while with post-vector-RA
vsetvli insertion enabled by default, this proposes to remove the flag
that restores the old pre-RA behaviour so we only have one configuration
going forward.
That flag was mainly meant as a fallback in case users ran into issues,
but I haven't seen anything reported so far.
This patch adds assembly/disassembly for the following SVE2.2
instructions
- SQABS (zeroing)
- SQNEG (zeroing)
- URECPE (zeroing)
- USQRTE (zeroing)
- Refactor the existing merging forms to remove the now redundant bit 17
argument.
- In accordance with:
https://developer.arm.com/documentation/ddi0602/latest/
For disassembler tables we use *V1_V4* variants for VIMAGE and then
remove unused vaddr fields. *V1_V1* variant, which has every vaddr
field other than vaddr0 set to 0, was also enabled and caused confusion
when decoding cases which used v0 (whose encoded value is 0)
With -fomit-frame-pointer, even if we set up a frame pointer for other
reasons (e.g. variable-sized or over-aligned stack allocations), we
don't need to create an ABI-compliant frame record. This means that we
can save all of the general-purpose registers in one push, instead of
splitting it to ensure that the frame pointer and link register are
adjacent on the stack, saving two instructions per function.
As part of FEAT_PAuthLR, a new DWARF Frame Instruction was introduced,
`DW_CFA_AARCH64_negate_ra_state_with_pc`. This instructs Libunwind that
the PC has been used with the signing instruction. This change includes
three commits
- Libunwind support for the newly introduced DWARF Instruction
- CodeGen Support for the DWARF Instructions
- Reversing the changes made in #96377. Due to
`DW_CFA_AARCH64_negate_ra_state_with_pc`'s requirements to be placed
immediately after the signing instruction, this would mean the CFI
Instruction location was not consistent with the generated location when
not using FEAT_PAuthLR. The commit reverses the changes and makes the
location consistent across the different branch protection options.
While this does have a code size effect, this is a negligible one.
For the ABI information, see here:
853286c7ab/aadwarf64/aadwarf64.rst (id23)
When llvm.memcpy or llvm.memmove intrinsics are lowered as a loop in
LowerMemIntrinsics.cpp, the loop consists of a single load/store pair
per iteration. We can improve performance in some cases by emitting
multiple load/store pairs per iteration. This patch achieves that by
increasing the width of the loop lowering type in the GCN target and
letting legalization split the resulting too-wide access pairs into
multiple legal access pairs.
This change only affects lowered memcpys and memmoves with large (>=
1024 bytes) constant lengths. Smaller constant lengths are handled by
ISel directly; non-constant lengths would be slowed down by this change
if the dynamic length was smaller or slightly larger than what an
unrolled iteration copies.
The chosen default unroll factor is the result of microbenchmarks on
gfx1030. This change leads to speedups of 15-38% for global memory and
1.9-5.8x for scratch in these microbenchmarks.
Part of SWDEV-455845.
This was introduced in the now-ratified RVA23 profile (and also added to
the RVA22 text) as a simple way of referring to H plus the set of
supervisor extensions required by RVA23.
https://github.com/riscv/riscv-profiles/blob/main/src/rva23-profile.adoc
This patch simply defines the extension. The next patch will adjust the
RVA23 profile to use it, and at that point I think we will be ready to
mark RVA23 as non-experimental.
Note that I haven't made it so if you enable all extensions that
constitute Sha, Sha is implied. Per #76893 (adding 'B'), the concern is
making this implication might break older external assemblers. Perhaps
this is less of a concern given the relative frequency of
`-march=${foo}_zba_zbb_zbs` vs the collection of H extensions. If we did
want to add that implication, we'd probably want to add it in a separate
patch so it can be easily reverted if found to cause problems.