Adds intrinsics for the following SME2 instructions:
* BFVDOT (32-bit)
* FVDOT (32-bit)
* SVDOT (2-way) (32-bit)
* SVDOT (4-way) (32-bit and 64-bit)
* UVDOT (2-way) (32-bit)
* UVDOT (4-way) (32-bit and 64-bit)
* SUVDOT (32-bit)
* USVDOT (32-bit)
NOTE: These intrinsics are still in development and are subject to future changes.
Differential Revision: https://reviews.llvm.org/D142000
Adds (single, multi & indexed) intrinsics for the following:
- bfmlal/bfmlsl
- fmlal/fmlsl
- smlal/smlsl
- umlal/umlsl
This patch also extends SelectSMETileSlice to handle scaled vector select offsets.
NOTE: These intrinsics are still in development and are subject to future changes.
Reviewed By: CarolineConcatto
Differential Revision: https://reviews.llvm.org/D142004
Adds intrinsics for the following:
- fmla (single, multi & indexed)
- fmls (single, multi & indexed)
NOTE: These intrinsics are still in development and are subject
to future changes.
Reviewed By: CarolineConcatto
Differential Revision: https://reviews.llvm.org/D141946
This is an alternative to D141074 to fix the problem by adjusting
the precision control dynamically.
This isn't quite complete yet. I want to support fadd with an load
folded into it too. That's the code we will usually generate.
Posting for early review so we can do some testing of this solution.
Differential Revision: https://reviews.llvm.org/D142178
This reverts commit a6e3027db7ebe6863e44bafcfeaacc16bdc88a3f.
Chrome and Halide are both reporting issues with importing builtins.
Maybe the better direction is to manually adjust FPCW for the inline
sequence on Windows.
Extension of https://reviews.llvm.org/D141101 to even further reduce the amount of implicit operands we attach. The main benefit is to improve cability of post-ra scheduler, and reduce unneeded dependency resolution (e.g. inserting snops).
Unfortunately, we run into regressions if we completely minimize the amount implicit operands (naively), we run into some regressions (e.g. dual_movs are replaced with multiple calls to v_mov). This is even more reason to switch to LiveRegUnits.
Nonetheless, this patch removes the operands which we can for free (more or less).
Change-Id: Ib4f409202b36bdbc59eed615bc2d19fa8bd8c057
Differential Revision: https://reviews.llvm.org/D141557
Change-Id: I8b039e3c0d39436b384083f8beb947ee1b1730b2
Linux kernel sets SCTRL_EL1.BT0 and BT1 to 1 unconditionally, which
makes PACIASP equivalent to BTI C + PACIA LR,SP.
Use the shorter instruction sequence by default.
I'm not aware of anyone who needs the opposite. They are welcome to
revert to the current behavior under a subtarget feature or an
environment check.
This reverts commit 571c8c5263a79293aaadae07b11feb36726eaf53.
Differential Revision: https://reviews.llvm.org/D141978
We were iterating over a SmallPtrSet when outputting slot variables.
This is still correct but made the test fail under reverse iteration.
This patch replaces the SmallPtrSet with a SmallVector.
Also remove the "Stack Frame Layout" lines from arm64-opt-remarks-lazy-bfi test,
since those also break under reverse iteration.
Reviewed By: nickdesaulniers
Differential Revision: https://reviews.llvm.org/D142127
Current implementation abuses ErrorMargin to apply an additional
bias to VGPR and SGPR limits under a high register pressure. The
ErrorMargin exists to account for inaccuracies of the RP tracker
and not to tackle an excess pressure. Introduce separate bias for
this purpose and also make it different for SGPRs and VGPRs as we
may want to use different values in the future.
This is supposed to be NFC, however there is a subtle difference
when subtracting a margin overflows the limit. Doing two subtractions
makes it less probable, although manifests only in mir tests with
an artificially small register budget.
Differential Revision: https://reviews.llvm.org/D142051
shuffle_vector instructions are serialized targeting SVE fixed vectors, see
https://reviews.llvm.org/D139111. This patch disables
optimizeExtendOrTruncateConversion peepholes that generates shuffle_vector.
Differential Revision: https://reviews.llvm.org/D141439
Only allow replacements of nodes that have a single user. This is better as
simple instructions (e.g. XGRK) are one cycle faster, and it helps in cases
where both inputs share a common node.
Review: Ulrich Weigand
This change makes the AsmPrinter emit OpExecutionMode ContractionOff
when both opencl.enable.FP_CONTRACT and spirv.ExecutionMode
metadata are not present.
Differential Revision: https://reviews.llvm.org/D141734
Add {e,r}flags, {g,f}s.base registers so they can be referenced in cfi
directives,. They are not otherwise useable in any instructions,
but can be implicitly pushed to the stack like with pushf for
{e,r}flags.
Differential Revision: https://reviews.llvm.org/D141879
This transofrm loses information that can be useful for other transforms.
Reviewed By: lebedev.ri
Differential Revision: https://reviews.llvm.org/D141883
Adds intrinsics for the following instructions:
* WHILEGE (predicate pair)
* WHILEGT (predicate pair)
* WHILEHI (predicate pair)
* WHILEHS (predicate pair)
* WHILELE (predicate pair)
* WHILELO (predicate pair)
* WHILELS (predicate pair)
* WHILELT (predicate pair)
I've added an opcode selector called SelectOpcodeFromVT to
AArch64ISelDAGToDAG.cpp that we will extend in future to
select opcodes from different MVTs. For now, the only use is
for selecting predicate types.
NOTE: These intrinsics are still in development and are subject
to future changes.
Differential Revision: https://reviews.llvm.org/D141936
`-loongarch-numeric-reg` for llvm-mc and llc.
`-M numeric` (which matches GNU objdump) for llvm-objdump and llvm-mc.
Reviewed By: SixWeining
Differential Revision: https://reviews.llvm.org/D141743
The code below currently prints less accurate values only on Windows 32-bit. On Windows, the default precision control on x87 is only 53-bit, and FADD triggers rounding with that precision, so the final result may be less accurate. This revision avoids less accurate conversions by using library calls instead.
```
int main() {
int64_t n = 0b0000000000111111111111111111111111011111111111111111111111111111;
printf("%lld, %.0f, %.0f", n, (float)n, (float)(uint64_t)n);
return 0;
}
```
Reviewed By: craig.topper, lebedev.ri
Differential Revision: https://reviews.llvm.org/D141074
Issue #58168 describes the difficulty diagnosing stack size issues
identified by -Wframe-larger-than. For simple code, its easy to
understand the stack layout and where space is being allocated, but in
more complex programs, where code may be heavily inlined, unrolled, and
have duplicated code paths, it is no longer easy to manually inspect the
source program and understand where stack space can be attributed.
This patch implements a machine function pass that emits remarks with a
textual representation of stack slots, and also outputs any available
debug information to map source variables to those slots.
The new behavior can be used by adding `-Rpass-analysis=stack-frame-layout`
to the compiler invocation. Like other remarks the diagnostic
information can be saved to a file in a machine readable format by
adding -fsave-optimzation-record.
Fixes: #58168
Reviewed By: nickdesaulniers, thegameg
Differential Revision: https://reviews.llvm.org/D135488
Add support for _FLT_ROUNDS_ in SystemZ.
Patch by Tulio Magno Quites Machado Filho.
Reviewed By: Ulrich Weigand
Differential Revision: https://reviews.llvm.org/D140988
As per the test case from Steven Johnson in https://reviews.llvm.org/rGf8d9097168b7#1165311
we can indeed encounter such shuffles, that produce all-undef after folding,
before something else manages to optimize them away.
On Darwin, function arguments occupy their real size when passed on the stack
(e.g. an i16 only consumes 2 bytes). This means that, even for fixed args in
varargs calls we need to keep track of the original type being passed before
any DAG/GISel promotions. Existing logic only applied this fix to the
non-varargs case leading to mismatch between caller & callee in those
situations.
On Linux & Windows these arguments always occupy a 64-bit slot anyway so
there's no special handling needed.
Origianl patch made a mistake that ugt is reverse cc should be ule.
And ule < C will be generalize to ult < C + 1. So the new patch add support for ult < Pow2 case.
https://alive2.llvm.org/ce/z/naBw5A
Reviewed By: samtebbs, chapuni
Differential Revision: https://reviews.llvm.org/D141829
Like in the scalar domain, combine calls to (fp_to_int (ftrunc X)) on
scalable and fixed-length vectors into a single vfcvt instruction.
For truncating rounds, the static vfcvt.rtz rounding mode is used.
Otherwise use the VFCVT_RM_ variants to set the rounding mode
dynamically.
Closes https://github.com/llvm/llvm-project/issues/56737
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D141599
This implements the fold (abs (sub nsw x, y)) -> abds(x, y). Providing
the sub is nsw this appears to be valid without the extensions that are
usually used for abds. https://alive2.llvm.org/ce/z/XHVaB3. The
equivalent abdu combine seems to not be valid.
Differential Revision: https://reviews.llvm.org/D141665
IR is now always parsed in opaque pointer mode, unless
-opaque-pointers=0 is explicitly given. There is no automatic
detection of typed pointers anymore.
The -opaque-pointers=0 option is added to any remaining IR tests
that haven't been migrated yet.
Differential Revision: https://reviews.llvm.org/D141912
If `getCoveringSubRegIndexes` returns a set of subregister indexes where some subregisters overlap others, it can create unsatisfiable copy bundles that eventually cause VirtRegRewriter to error out due to "cycles in copy bundle".
We can simply prevent this by making the algorithm skip over subregisters indexes that would cause an overlap with already-covered lanes.
Note that in the case of AMDGPU, this problem is caused by the lack of subregisters indexes for 13/14/15-register tuples. We have everything up until 12, then we have 16 and 32 but nothing between 12 and 16.
This means that the best candidate to do the least amount of copies when splitting a 29-register tuple was to copy (e.g.) 0-15 and 14-29, causing an overlap.
With this change, getCoveringSubRegIndexes will now prefer using something like 0-15, 16-28 and 1
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D141576
The REINTERPRET_CAST operation generates redundant and and ptrue instructions.
For some instructions, this is redundant, because its inactive lanes are zeroed by construction.
For example. Codegen before:
```
facgt p2.d, p0/z, z4.d, z1.d
ptrue p1.d
and p1.b, p2/z, p2.b, p1.b
```
After:
```
facgt p1.d, p0/z, z4.d, z1.d
```
ref: https://reviews.llvm.org/D129851
Reviewed By:sdesmalen,paulwalker-arm
Differential Revision:https://reviews.llvm.org/D141469
Introduce a parameter in getFallThrough() to optionally
allow returning the fall through basic block in spite of
an explicit branch instruction to it. This parameter is
set to false by default.
Introduce getLogicalFallThrough() which calls
getFallThrough(false) to obtain the block while avoiding
insertion of a jump instruction to its immediate successor.
This patch also reverts the changes made by D134557 and
solves the case where a jump is inserted after another jump
(branch-relax-no-terminators.mir).
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D140790
Let Propeller use specialized IDs for basic blocks, instead of MBB number.
This allows optimizations not just prior to asm-printer, but throughout the entire codegen.
This patch only implements the functionality under the new `LLVM_BB_ADDR_MAP` version, but the old version is still being used. A later patch will change the used version.
####Background
Today Propeller uses machine basic block (MBB) numbers, which already exist, to map native assembly to machine IR. This is done as follows.
- Basic block addresses are captured and dumped into the `LLVM_BB_ADDR_MAP` section just before the AsmPrinter pass which writes out object files. This ensures that we have a mapping that is close to assembly.
- Profiling mapping works by taking a virtual address of an instruction and looking up the `LLVM_BB_ADDR_MAP` section to find the MBB number it corresponds to.
- While this works well today, we need to do better when we scale Propeller to target other Machine IR optimizations like spill code optimization. Register allocation happens earlier in the Machine IR pipeline and we need an annotation mechanism that is valid at that point.
- The current scheme will not work in this scenario because the MBB number of a particular basic block is not fixed and changes over the course of codegen (via renumbering, adding, and removing the basic blocks).
- In other words, the volatile MBB numbers do not provide a one-to-one correspondence throughout the lifetime of Machine IR. Profile annotation using MBB numbers is restricted to a fixed point; only valid at the exact point where it was dumped.
- Further, the object file can only be dumped before AsmPrinter and cannot be dumped at an arbitrary point in the Machine IR pass pipeline. Hence, MBB numbers are not suitable and we need something else.
####Solution
We propose using fixed unique incremental MBB IDs for basic blocks instead of volatile MBB numbers. These IDs are assigned upon the creation of machine basic blocks. We modify `MachineFunction::CreateMachineBasicBlock` to assign the fixed ID to every newly created basic block. It assigns `MachineFunction::NextMBBID` to the MBB ID and then increments it, which ensures having unique IDs.
To ensure correct profile attribution, multiple equivalent compilations must generate the same Propeller IDs. This is guaranteed as long as the MachineFunction passes run in the same order. Since the `NextBBID` variable is scoped to `MachineFunction`, interleaving of codegen for different functions won't cause any inconsistencies.
The new encoding is generated under the new version number 2 and we keep backward-compatibility with older versions.
####Impact on Size of the `LLVM_BB_ADDR_MAP` Section
Emitting the Propeller ID results in a 23% increase in the size of the `LLVM_BB_ADDR_MAP` section for the clang binary.
Reviewed By: tmsriram
Differential Revision: https://reviews.llvm.org/D100808
The dependency was due to the log format. This change switches to the
previously-introduced (D139370) "dependency-free" logger instead of the
protobuf-based one.
A subsequent change will clean out the unnecessary abstraction left
behind.
This change drops the logger unittest, we have sufficient test coverage
via lit tests, and a unit test would require adding, unnecesarily, a log
reader (the reader is expected to be python, for the ML side, and there
is a reader for that under Analysis/models, used for tests).
Differential Revision: https://reviews.llvm.org/D141720
There's a conflict between the riscv32 and riscv64 output for some
tests which caused the script to drop the check lines.
Add specific check prefixes for these cases.
According to https://developer.arm.com/documentation/102105/ia-00/?lang=en
> Arm is making a retrospective change to the SVE architecture to remove
> the capability of selecting a non-power-of-two vector length in
> non-Streaming SVE as well as in Streaming SVE mode. Specific updates as
> a result of this change will be communicated in due course.
This patch implements the isVScaleKnownToBeAPowerOfTwo method to teach
DAG Combines that VScale will be known to be a power of 2, which helps
reduce or simplify some expressions (notably the udiv in vector trip
count expressions).
Differential Revision: https://reviews.llvm.org/D141486
Add an extra command line option to `llc` that allows checking at what cycle an instruction has been scheduled by the machine scheduler.
Differential Revision: https://reviews.llvm.org/D141289
This fix is similar to D124325, and I find the DestructiveBinaryComm
operation type also may be allocated same register, so insert the LSL.
movprfx z0.s, p0/z, z0.s
lsl z0.b, p0/m, z0.b, #0
fmul z0.s, p0/m, z0.s, z0.s
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D141471
The existing lowering of i1 vector shuffle was only considering
single-source shuffles, always assuming the second was undef. This
extends that to properly handle both operands.