The X86FixupLEAs pass drops blockaddress offsets, when splitting up slow
3-ops LEAs, as can be seen in this example:
https://godbolt.org/z/bEsc3Poje
Before running the pass, the first instruction in bb.0 is a LEA with
ebp, ebx and a blockaddress.
After the transformation, the blockaddress is missing.
The reason this happens is because the 3-ops LEA is being splitup into a
2-ops LEA + an add instruction.
However, as hasLEAOffset does not take blockaddresses into
consideration, the add is not emitted and thus leading to the offset
being dropped.
Taking blockaddresses into consideration fixes this issue and results in
the add instruction being emitted.
This fixes#71667
When generating the assembly code for AIX/XCOFF, the .file pseudo-op
needs to be emitted first, before any csects are generated. Otherwise,
information such as the embedded command line will be associated with
part of the object file rather than the entire object file.
We can match this directly in isel with the i32 type being legal.
The generic DAG combine will unpromote part of the pattern and
prevent it from being matched in isel.
The WebKit Calling Convention was created specifically for the WebKit
FTL. FTL
doesn't use LLVM anymore and therefore this calling convention is
obsolete.
This commit removes the WebKit CC, its associated tests, and
documentation.
Extend our PRE logic to cover non-immediate AVL values. This covers
large constant AVLs (which must be materialized in registers), and may
help some code written explicitly with intrinsics.
Looking at the existing code, I can't entirely figure out why I thought
we needed VL == AVL to perform the PRE. My best guess is that I was
worried about the VLMAX < VL < 2 * VLMAX case, but the spec explicitly
says that vsetvli must be determinist on any particular AVL value.
That case was, possibly by accident, covering another legality
precondition. Specifically, by only returning true for immediate and
VLMAX AVL values, we didn't encounter the case where the AVL was a
register and that register wasn't available in the predecessor (e.g. if
AVL is a load in the MBB block itself).
---------
Co-authored-by: Luke Lau <luke_lau@icloud.com>
1. Map R16-R31 to DWARF registers 130-145.
2. Make R16-R31 caller-saved registers.
3. Make R16-31 allocatable only when feature EGPR is supported
4. Make R16-31 availabe for instructions in legacy maps 0/1 and EVEX
space, except XSAVE*/XRSTOR
RFC:
https://discourse.llvm.org/t/rfc-design-for-apx-feature-egpr-and-ndd-support/73031/4
Explanations for some seemingly unrelated changes:
inline-asm-registers.mir, statepoint-invoke-ra-enter-at-end.mir:
The immediate (TargetInstrInfo.cpp:1612) used for the regdef/reguse is
the encoding for the register
class in the enum generated by tablegen. This encoding will change
any time a new register class is added. Since the number is part
of the input, this means it can become stale.
seh-directive-errors.s:
R16-R31 makes ".seh_pushreg 17" legal
musttail-varargs.ll:
It seems some LLVM passes use the number of registers rather the number
of allocatable registers as heuristic.
This PR is to reland #67702 after #70222 in order to reduce some
compile-time regression when EGPR is not used.
This patch enhances the optimization of memcmp calls when only two
outcomes
are needed and comparison fits into one block, for example:
bool result = memcmp(a, b, 6) > 0;
Previously, LLVM would generate unnecessary operations even when the
user of
memcmp was only interested in a binary outcome.
Function parameters marked with inreg are supposed to be allocated to
SGPRs. However, for compute functions, this is ignored and function
parameters are allocated to VGPRs. This fix modifies CC_AMDGPU_Func in
AMDGPUCallingConv.td to use SGPRs if input arg is marked inreg.
---------
Co-authored-by: Jun Wang <jun.wang7@amd.com>
We can assume that the minimal SVE register is 128-bit, when NEON is not
available. And we can lower the shuffle shuffle operation with one
operand to TBL1 SVE instruction.
- Revert "[DAGCombiner] Transform `(icmp eq/ne (and X,C0),(shift X,C1))`
to use rotate or to getter constants." - causes a miscompile, see
112e49b381 (commitcomment-131943923)
- Revert "[X86] Fix gcc warning about mix of enumeral and non-enumeral
types. NFC", which fixes a compiler warning in the commit above
Track the live register state immediately before, instead of after,
MBBI. This makes it simple to track the state at the start or end of a
basic block without a separate (and poorly named) Tracking flag.
This changes the API of the backward(MachineBasicBlock::iterator I)
method, which now recedes to the state just before, instead of just
after, *I. Some clients are simplified by this change.
There is one small functional change shown in the lit tests where
multiple spilled registers all need to be reloaded before the same
instruction. The reloads will now be inserted in the opposite order.
This should not affect correctness.
The landing of https://reviews.llvm.org/D88663 renders the existing
stp-opt-with-renaming-undef-assert test useless because the picked
register for renaming becomes q0 instead of q16.
Add patterns to select int_amdgcn_set_inactive_chain_arg to
V_SET_INACTIVE.
This could probably use some more testing, but at least for simple cases
V_SET_INACTIVE seems to mostly work out of the box.
Differential Revision: https://reviews.llvm.org/D158605
Initialize the SP to 0 in the prologue of functions with the
`amdgpu_cs_chain` or `amdgpu_cs_chain_preserve` calling conventions, but
only if they need one (i.e. if they contain calls to `amdgpu_gfx`
functions or if they have stack objects).
Also make sure we don't try to realign the stack (since 0 is aligned
enough).
Differential Revision: https://reviews.llvm.org/D156413
Teach prolog epilog insertion how to handle functions with the
amdgpu_cs_chain or amdgpu_cs_chain_preserve calling conventions.
For amdgpu_cs_chain functions, we only need to preserve the inactive
lanes of VGPRs above v8, and only in the presence of calls via
@llvm.amdgcn.cs.chain.
For amdgpu_cs_chain_preserve functions, we will also need to preserve
the active lanes for registers above the last argument VGPR. AFAICT
there's no direct way to find out what the last argument VGPR is, so
instead the patch uses the fact that chain calls from
amdgpu_cs_chain_preserve functions can't use more VGPRs than the
caller's VGPR arguments. In other words, it removes the operands of
SI_CS_CHAIN_TC instructions from the list of callee saved registers.
For both calling conventions, registers v0-v7 never need to be saved and
restored, so we should never add them as WWM spills.
Differential Revision: https://reviews.llvm.org/D156412
After #71483 we now have a way of marking masked pseudos as having an
unmasked
equivalent, but their mask shouldn't be folded unless it's all ones
since it
would affect the result.
This patch uses it to mark the pseudos for vredsum and friends, which in
turn
allows us to remove the unmasked patterns, and catch some other forms of
vmerge.
Move SIPreAllocateWWMRegs pass to just before VGPR allocation. This
saves recomputation of the virtual matrix and live reg map, with the
slight regression in O0 that live intervals and slot indexes must be
computed.
…gSizeInBits
This patch changes getRegSizeInBits to return a TypeSize instead of an
unsigned in the case that a virtual register has a scalable LLT. In the
case that register is physical, a Fixed TypeSize is returned.
The MachineVerifier pass is updated to allow copies between fixed and
scalable operands as long as the Src size will fit into the Dest size.
This is a precommit which will be stacked on by a change to GISel to
generate COPYs with a scalable destination but a fixed size source.
This patch is stacked on https://github.com/llvm/llvm-project/pull/70893
for the ability to use scalable vector types in MIR tests.
As far as I can tell, there's nothing in this code which actually
assumes the two predicates in (FoundLHS FoundPred FoundRHS) => (LHS Pred
RHS) are the same.
Noticed while investigating something else, this is purely an
oppurtunistic optimization while I'm looking at the code. Unfortunately,
this doesn't solve my original problem. :)
I assumed indexes were always ConstantInts, but that's not always the
case. They can be other things as well. We can easily handle that by
just emitting an add and let InstSimplify do the constant folding for
cases where it's really a ConstantInt.
Solves SWDEV-429935
V3 has been deprecated for a while as well, so it can safely be removed
like V2 was removed.
- [Clang] Set minimum code object version to 4
- [lld] Fix tests using code object v3
- Remove code object V3 from the AMDGPU backend, and delete or port v3
tests to v4.
- Update docs to make it clear V3 can no longer be emitted.
For more recent sve capable CPUs it is beneficial to use the inc*
instruction
to increment a value by vscale (potentially shifted or multiplied) even
in
short loops.
This patch tells codegenprepare to sink appropriate vscale calls into
blocks where they are used so that isel can match them.
First PseudoVMERGE_VIM_M1 should use %pt1 as its operand instead of
%pt2.
I found this error when I add LiveIntervals analysis pass in my
downstream. And it crashes with the message:
```
Use of %7 does not have a corresponding definition on every path:
112r %6:vrnov0 = PseudoVMERGE_VIM_M1 %pt2:vrnov0(tied-def 0), %2:vr, 1, %4:vmv0, 1, 3
LLVM ERROR: Use not jointly dominated by defs.
```
Remove support for the fptrunc, fpext, fptoui, fptosi, uitofp and sitofp
constant expressions. All places creating them have been removed
beforehand, so this just removes the APIs and uses of these constant
expressions in tests.
With this, the only remaining FP operation that still has constant
expression support is fcmp.
This is part of
https://discourse.llvm.org/t/rfc-remove-most-constant-expressions/63179.
This transformation might be illegal for `PseduoVIOTA_M`. The value of
`viota.m vd, vs2` is the prefix sum of vd2 and adding mask for it may
cause wrong prefix sum.
Take an example, the result of following expression is `{5, 5, 5, 3}`,
```
; v4 = {1, 1, 1, 1}
viota.m v1, v4
; v0 = {0, 0, 0, 1}, v1 = {0, 1, 2, 3}, v8 = {5, 5, 5, 5}
vmerge.vvm v8, v8, v1, v0.t
; v8 = {5, 5, 5, 3}
```
but if we merge them to `viota.m v8, v4, v0.t`, then the result of is
`{5, 5, 5, 0}`.
Also, we still does `performCombineVMergeAndVOps` for `voita.m` when
mask of `vmerge.vvm` is a true mask.
The special handling for blocks ending with a long branch has been
unnecessary since D106445:
"[amdgpu] Add 64-bit PC support when expanding unconditional branches."